flat assembler
Message board for the users of flat assembler.

Index > Heap > 86 Mac Plus Vs. 07 AMD DualCore. You Won't Believe Who Wins

Goto page Previous  1, 2

What do you think of this?
Wow, this blows my mind!!!
8%
 8%  [ 1 ]
Interesting, I always suspected this ...
66%
 66%  [ 8 ]
Bah, they're stretching the truth too far!
8%
 8%  [ 1 ]
No big deal, who cares?!
16%
 16%  [ 2 ]
Total Votes : 12

Author
Thread Post new topic Reply to topic
kohlrak



Joined: 21 Jul 2006
Posts: 1421
Location: Uncle Sam's Pad
kohlrak
Quote:
kohlrak: I advice you to do a little research on How Stuff Works[sup]TM[/sup] before you make statements like that

You could read up on bus widths, the ever-increasing data rates and how RAM has evolved (edo, sd, ddr{1,2,3}, rambus, ...), PCI and PCI-e bus mastering, DMA transfers, the AGP hack, et cetera.

But sure thing, there's a lot of situations where RAM is the bottleneck. Thus, read up on caching, techniques like strip-mining, et cetera.


Usually i just stick to the docks i've been handed. The only docs i have recieved on the processor was the art of assembly's material on architecture. It always lead me to believe that the whole process is one big ugly argument for resources.

Quote:
Problem is that each CPU does need to be able to access *all* the memory in the system, so even though you might have two RAM buses with separate memory (google for NUMA), you still need some interconnect. This is somewhat complicated to program for, since you now have the notion of "fast" and "slow" ram, not to mention that the CPU caches need to be synchronized.


As well and seperate ram for each CPU. You could split your program to run on the seperate CPUs which could be synced with the cards and hardware, which isn't as important speedwise as the memory.

Quote:
A lot of people have flamed the core2 architecture for using a shared cache scheme instead of separate-per-core cache, but imho the scheme is pretty smart, at least theoretically (I haven't looked at how the implementation affects things).


And what's the theory?
Post 04 Jul 2007, 05:14
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
kohlrak wrote:
Usually i just stick to the docks i've been handed.

Then you've been reading very flawed documents, or you've misunderstood them severely Smile

kohlrak wrote:
As well and seperate ram for each CPU. You could split your program to run on the seperate CPUs which could be synced with the cards and hardware, which isn't as important speedwise as the memory.

Umm... come again?

kohlrak wrote:

f0dder wrote:

A lot of people have flamed the core2 architecture for using a shared cache scheme instead of separate-per-core cache, but imho the scheme is pretty smart, at least theoretically (I haven't looked at how the implementation affects things).

And what's the theory?

You'll often need to access some of the same data in several threads. This could be look up tables or other data structures, it could be multiple threads running the same code, and it could be OS data structures and code.

Instead of duplicating this data in the cache of each core, you have it once in shared cache, leaving more free cache for non-shared data... and you get rid of some syncing overhead in the process.
Post 04 Jul 2007, 08:01
View user's profile Send private message Visit poster's website Reply with quote
kohlrak



Joined: 21 Jul 2006
Posts: 1421
Location: Uncle Sam's Pad
kohlrak
Quote:
Then you've been reading very flawed documents, or you've misunderstood them severely


Or maybe they barely touch the subject, but give you enough information to think they covered everything.

Quote:
Umm... come again?


In other words, you could impliment a system where each core has it's own set of ram, plus a main set. Then you get a main set of ram, but the two threads (in the situation where you would use two cores) could be split between the two processors for each thread. Of course, then you would have to handle the situation of when more than 2 threads are being around, but if only two threads, it can be easily seperated and the CPUs only sync when they access "global data." Then, you don't have to worry as much about the over head of virtual threading. Then, they don't have to share the same bus. One of the two can do all the work with working the resources (throwing around the graphic, sound, and other i/o data) as needed, the other can worry about the other calculations, then you're not doing both in the same process, but that would leave more work on the programmer to use his or her intellegence to use the system effectively. Like having the dual core thing, only now the programmer has more control, and you have more than one line (2 or 3 depending on how the manufactuer would plan on implimenting such).

Quote:
You'll often need to access some of the same data in several threads. This could be look up tables or other data structures, it could be multiple threads running the same code, and it could be OS data structures and code.

Instead of duplicating this data in the cache of each core, you have it once in shared cache, leaving more free cache for non-shared data... and you get rid of some syncing overhead in the process.


And the advantage all depends on how often you'd require syncing for your algorithem. If you seldom sync, you're just restricting yourself. Though, the concept would be alot cheaper than mine...
Post 04 Jul 2007, 12:02
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
Hm, both per-cpu as well as shared memory? So, do you load all OS code and data into both shared memory and each CPU memory, or do you keep it just in shared memory? Whatever you do, you need syncing.

If the per-cpu memory is only available to the individual CPU, to share data you'd need to copy to the shared memory, and possibly back again - might as well keep shared content only in the shared memory, then.

Sounds smarter to me to keep all memory 'shared', but possibly with the sense of "fast" and "slow" (or "near" and "far") resources. Check out NUMA for some starting points.

Of course it'd still be nicest with a single memory pool that was fast enough to saturate all CPUs, without any special coding, but that's probably not too feasible with 4- and 8-core machines, and whatever more the future brings.

kohlrak wrote:

And the advantage all depends on how often you'd require syncing for your algorithem. If you seldom sync, you're just restricting yourself. Though, the concept would be alot cheaper than mine...

The syncing I'm talking about isn't at algorithmic/logical level, it's hardware level for things like cache coherency, which is necessary. Unless you want the software to become pretty complex.

There's other things you could do, sure, which work in more specialized systems... like the "drone" processors in the PS/3, which afaik don't have direct access to the RAM; instead the master processor feeds data into their (limited but superfast) memory. This seems doable for doing SIMD operations on large streams of data, but probably not too comfortable for general-purpose code like what runs on x86... playstations have always been tricky to program, but able to do pretyt neat tricks once the programmers learned the dos and donts.

(My knowledge of the PS/3 is pretty superficial, so I could be wrong above the above paragraph).
Post 04 Jul 2007, 23:35
View user's profile Send private message Visit poster's website Reply with quote
kohlrak



Joined: 21 Jul 2006
Posts: 1421
Location: Uncle Sam's Pad
kohlrak
Quote:
Hm, both per-cpu as well as shared memory? So, do you load all OS code and data into both shared memory and each CPU memory, or do you keep it just in shared memory? Whatever you do, you need syncing.


Loading, but once it's loaded you could just execute away. Programs typically take a long time to start up, anyway, so it's not really noticable weather that part is fast or slow, but once it's loaded it could execute faster. You can load the parts of the OS you're using into the individual sections of memory.
Post 05 Jul 2007, 19:06
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
I'll let somebody else comment on that. But one sentence: it's not because of RAM speeds that loading is slow.
Post 06 Jul 2007, 10:43
View user's profile Send private message Visit poster's website Reply with quote
kohlrak



Joined: 21 Jul 2006
Posts: 1421
Location: Uncle Sam's Pad
kohlrak
You'll have to pardon me, for i was... half asleep when i typed all that. I'm not saying that the loading speed is slow because of ram, i'm saying that you can take advantage of slow loading, because the user won't be able to tell the difference, to load it into seperate rams for each cpu. Though, yes, it'd complicate things quite a bit, it would still improve speed.
Post 06 Jul 2007, 14:48
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
I give up.
Post 06 Jul 2007, 14:56
View user's profile Send private message Visit poster's website Reply with quote
DustWolf



Joined: 26 Jan 2006
Posts: 373
Location: Ljubljana, Slovenia
DustWolf
f0dder wrote:
Hm, both per-cpu as well as shared memory? So, do you load all OS code and data into both shared memory and each CPU memory, or do you keep it just in shared memory? Whatever you do, you need syncing.


Just wondering what exactly is the magical universal mechanism used in OSs to divide tasks amongst CPUs and keep their threads from causing unexpected async problems?

Is there one? I mean I thought this was up to the programmer and a really good method was something they just never came up with as of yet.

Sometimes the method in which this stuff is done is kind of determined by the hardware, since some hardware is typical in some situations and it uses a lot of CPU trough the ISRs it uses to communicate with the more general OS.

For example I saw some really good server multi CPU architectures, where one CPU is connected to the networking hardware, while the other is connected to the harddisk hardware, each has a cupple GB of RAM and they have a fast buss for interprocessor communication, which allows them to access eachother's RAM. The gigabit NICs then have a CPU of their own, dedicated to doing their checksumming and other ISRs by the hardware itself, whereas the other CPU can dedicate itself to more generic OS operations. That kind of trick gives no special advantage to generic software unless it's been programmed right (OS included), but it does do networking a whole lot better than any other dual CPU system, which is important for a network server / major internet gateway.
Post 06 Jul 2007, 18:34
View user's profile Send private message AIM Address Yahoo Messenger MSN Messenger Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
DustWolf wrote:

Just wondering what exactly is the magical universal mechanism used in OSs to divide tasks amongst CPUs and keep their threads from causing unexpected async problems?

Unexpected async problems at the application level is up to, ta-da, the applications, and (where applicable) libraries (consider kernel32, user32 and friends "libraries" in this context).

At the system level, you have to design your kernel to be pre-emptive (or do some very ugly and coarse locks, which will damage scalability) - protect kernel structures with mutexes, use lock-free algorithms where you can, et cetera.

DustWolf wrote:

For example I saw some really good server multi CPU architectures, where one CPU is connected to the networking hardware, while the other is connected to the harddisk hardware, each has a cupple GB of RAM and they have a fast buss for interprocessor communication, which allows them to access eachother's RAM.

Humm, if we're talking x86, it would seem a waste to me to dedicate an entire CPU *just* to network or harddisk... even gigabit networking doesn't place too much strain on a core, neiter does 100mbyte/sec harddisk throughput (unless you're in PIO mode Wink ).

But if you know you're designing for specific usage patterns, you can probably do a specialized scheduler (or perhaps even a generic scheduler but with some specialized priorities and processor affinity masks) that are more optimal than "whatever default scheme".

DustWolf wrote:

The gigabit NICs then have a CPU of their own, dedicated to doing their checksumming and other ISRs by the hardware itself, whereas the other CPU can dedicate itself to more generic OS operations.

These days, even cheap onboard NICs tend to have various levels of acceleration, at the very least checksum generation. Accelerating tasks in custom hardware instead of using 100% generic CPUs can be advantagous, just think of how much C=64 and Amiga systems accomplish although their CPUs were pretty puny, simply because they had other chips for other tasks.
Post 07 Jul 2007, 22:21
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Quote:

even gigabit networking doesn't place too much strain on a core,

True for 1 gbps, false for 10 gbps. When the NIC is 10 gbps is recommended to switch to a poll scheme rather than a interruption scheme. In 10 gbps there are lots of frames per second and even with a dedicated core you probably can't use the entire bandwidth capacity, 10 gbps is really a lot.

PS: I heard that even with 1 gbps you still could need a separated core because switching from one NIC to another can't be done at full speed otherwise but this is just something what I heard somewhere else...
Post 07 Jul 2007, 22:44
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
Ah, 10gbit is different from "just" gigabit (which didn't place too much strain on my system).

I don't have experience with faster than 1gbit connections, so can't really speak much about them. But I can see even for my onboard 1gbit, there's an option to modify the "max interrupts per second" rate - I expect you can get less CPU strain my lowering this, at the cost of some latency. But for a 10gbit connection, what's the most important? low latency or high throughput? Smile

Also, frame size would have something to say... "jumbo frames" are at the link layer, right?

If we start talking about routing at 10gbit and beyond, I think custom hardware is better suited than x86 anyway, but I'll happily admit that I've never designed such a system Smile
Post 07 Jul 2007, 23:09
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
With 10 gbps how many jumbo frames (9000 bytes) can we have? Would you say more than 100,000 frames? I think it is a lot of interrupts, with polling there is very little chance to found the NIC's buffer free with such rate. About interrupting after some interval or when the NIC's buffer is full (whatever comes first), still has the problem that TCP/IP has to wait confirmation and with such high-speed NIC's the TCP window is filled terribly fast so here we are in a situation in which we need low latency and high throughput at the same time.

About 1 gbps switching there is more than 10,000*4 jumbo frames(10,000*2 interruptions if NICs interrupts when a frame has been recieved) per second. And lets forget the interrupts for a moment, if you allow the core to dedicate to threads other than the one that handles the NICs you are "poisoning" the CPU cache and hence reducing the switching performance.

Quote:

Ah, 10gbit is different from "just" gigabit (which didn't place too much strain on my system).


I tried 1 gbps once here with a PPP UTP wiring and I haven't noticed 100 % CPU utilization neither but the speed was of just 25-35 MB/s so there is even something more limiting speed (which was not the HDD because both computers can handle a sustained speed of 60 MB/s and second try was cached anyway). Anyway that's my story. Did you get rates very near to 100 MB/s? Razz
Post 08 Jul 2007, 00:23
View user's profile Send private message Reply with quote
DustWolf



Joined: 26 Jan 2006
Posts: 373
Location: Ljubljana, Slovenia
DustWolf
f0dder wrote:
DustWolf wrote:

For example I saw some really good server multi CPU architectures, where one CPU is connected to the networking hardware, while the other is connected to the harddisk hardware, each has a cupple GB of RAM and they have a fast buss for interprocessor communication, which allows them to access eachother's RAM.

Humm, if we're talking x86, it would seem a waste to me to dedicate an entire CPU *just* to network or harddisk... even gigabit networking doesn't place too much strain on a core, neiter does 100mbyte/sec harddisk throughput (unless you're in PIO mode Wink ).


Well I didn't mean dedicating as in having all of the CPU just for that task, but rather dedicated as in having all of the interupts being executed explicitly on that CPU.

As I understand it, if the interrupt hardware lines are plugged into one CPU, the ISRs will execute on that CPU and no other. And, regardless of everything else, that would make the ISR-based execution at least work better than in a clasic single-CPU, multi-core setting. The rest would seem to be up to the OS scheduler.

And yes I meant x86... actually I was refering to this (mind the spam...):
Image
Post 08 Jul 2007, 11:01
View user's profile Send private message AIM Address Yahoo Messenger MSN Messenger Reply with quote
DustWolf



Joined: 26 Jan 2006
Posts: 373
Location: Ljubljana, Slovenia
DustWolf
LocoDelAssembly wrote:
I tried 1 gbps once here with a PPP UTP wiring and I haven't noticed 100 % CPU utilization neither but the speed was of just 25-35 MB/s so there is even something more limiting speed (which was not the HDD because both computers can handle a sustained speed of 60 MB/s and second try was cached anyway). Anyway that's my story. Did you get rates very near to 100 MB/s? Razz


Just wondering: Do ISR's even show up on the method of CPU usage measurment you're using? As I recall some operating systems use some brilliant reverse scheeme that masks away any CPU usage being used by ISRs, CPU power saving procedures and other under-the-hood processes.

If we're speaking windows here I think there are also some other OS-enforced bottlenecks. Not counting the QoS service that purposefully limits the bandwidth usage, there are also timed pass-trough mechanisms in it that handle networking. If my understanding is correct, by default settings they have rather low priority which means that the CPU comes around to handle them rarely, regardless of CPU usage.

There is then also the issue of retransmittion, where over noisy data lines the NICs and switches will automatically retransmit the data, hence lowering the overall efficiency (with "checksum offload" enabled, the processing for this will happen inside the NICs themselves and won't show up anywhere, not to mention any network switches inbetween do this automatically without providing an OFF option).

If it's windows, try the same bandwidth test on Linux, I have both machine types here and the linux boxes usually do the transmittions twice as efficiently as windows ones.
Post 08 Jul 2007, 11:18
View user's profile Send private message AIM Address Yahoo Messenger MSN Messenger Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You can attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar.

Powered by rwasa.