flat assembler
Message board for the users of flat assembler.

Index > Main > 64bit RingBuffered Array QUEUE !!!!

Goto page Previous  1, 2, 3
Author
Thread Post new topic Reply to topic
bitRAKE



Joined: 21 Jul 2003
Posts: 4073
Location: vpcmpistri
bitRAKE 15 Oct 2008, 02:01
r22 wrote:
Because of the latency on the LOCKed CMPXCHG opcode in the concurrent stack I think the task list would probably outperform it until around 16+ entries. Which means as long as you have more GETters than SETters you could probably get away with a small fast implementation like the task list.
Every SETter should be a GETter - when SetTask fails they switch state. This insures that frozen threads (long calculations) don't hold everything up, and also reduces the number of threads needed.

I haven't found any latency numbers for XCHG with memory operand nor CMPXCHG with LOCK. Have you devised a test? From my understanding it would be based on bus speed. Intel says every locked read has a locked write, so CMPXCHG will write to memory even when the value has changed.

It seems like CMPXCHG would be slower - something is happening between the read and write. But it is difficult to know without testing if this "something" is even significant in the overall latency timing. In contrast the read and write for XCHG could be issued back-to-back, afaik.

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 15 Oct 2008, 02:01
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 16 Oct 2008, 01:36
CMPXCHG is 3 clocks for (mreg, reg) and 5 for (mem, reg). I don't know if the LOCK adds more latency even if there's no contention.
Post 16 Oct 2008, 01:36
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 16 Oct 2008, 02:04
XCHG takes 16 cycles according to AMD optimization manual (Issue Date: September 2005). I guess that the implicit lock prefix adds the extra latency but still, is CMPXCHG in fact slower than XCHG on a failed attempt to set?
Post 16 Oct 2008, 02:04
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4073
Location: vpcmpistri
bitRAKE 02 Dec 2008, 00:56
What Every Programer Should Know About Memory is a very good modern overview. Even covers the problem of using CAS for every atomic operation on x86(-64) processors - CMPXCHG is hardly the answer with the flexiblity availble.
Post 02 Dec 2008, 00:56
View user's profile Send private message Visit poster's website Reply with quote
Azu



Joined: 16 Dec 2008
Posts: 1159
Azu 03 Nov 2009, 22:22
LocoDelAssembly wrote:
Sorry, there was an error in a comment.

I also forgot to explain what is this program testing... It is simulating the scenario in which a low priority thread makes to update qEnqu pointer but gets immediately preempted by higher priority threads. The higher priority threads will succeed in queuing their own data, but when all of them will try to dequeue an element they will continually fail because the low priority thread haven't set the element on the array yet so the value of the first element is still zero. Since the high priority threads are still consuming lots of CPU time, the lower priority thread takes a lot of time to execute "mov [eax], ecx" (it is below ".putElement:"). Once the lower priority thread makes into setting its queue slot the other threads are able to complete their work inmedatelly and the simulation finishes. It takes in average 5 seconds on my computer. By uncommenting "THREAD_PRIORITY_TIME_CRITICAL=THREAD_PRIORITY_LOWEST" the average time is reported to be "0 ms" instead of around "5200 ms".

What I was trying to prove with all this? Only that if threads of heterogeneous priorities exists in the process, the queue can be mistakenly reported as empty when it is not (supposing than a failed dequeue means "empty queue") and/or the program could stall for a while. And in the case of homogeneous priorities the problem of having a failed dequeue even though when there are several slots with data is still possible, but the problem should last considerably less time than in the simulated case.

I'm planning to port the pseudo-codes of the paper into fasm too, so we can study the pros and cons of the three versions and conduct some benchmarks (probably made by someone else Wink).

[edit]Something perhaps not so obvious till you download the attachment, I ported the code to 32-bit because I don't have a 64-bit Windows and 32-bit is better for the sake of testing since more people can see this working on their own computers[/edit]
What a mess! Shouldn't the thread scheduler in the OS be able to detect spinlocks and allot time appropriately when one occurs? (I know it doesn't, what I mean is.. why doesn't it)

_________________
Post 03 Nov 2009, 22:22
View user's profile Send private message Send e-mail AIM Address Yahoo Messenger MSN Messenger ICQ Number Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 03 Nov 2009, 23:08
When using the OS provided synchronization features some actions can be taken, but in this case the OS should be smart enough to know what the program is doing. We humans know the program is wasting a lot of CPU time because it is waiting for a low priority thread to finish executing a single instruction, but from the OS perspective it can't know much further than "this process consumes a lot of time doing an unknown (but yet user requested) task".

LocoDelAssembly wrote:
I'm planning to port the pseudo-codes of the paper into fasm too, so we can study the pros and cons of the three versions and conduct some benchmarks (probably made by someone else Wink).
Hmm, what a lair...
Post 03 Nov 2009, 23:08
View user's profile Send private message Reply with quote
Azu



Joined: 16 Dec 2008
Posts: 1159
Azu 03 Nov 2009, 23:18
LocoDelAssembly wrote:
When using the OS provided synchronization features some actions can be taken, but in this case the OS should be smart enough to know what the program is doing. We humans know the program is wasting a lot of CPU time because it is waiting for a low priority thread to finish executing a single instruction, but from the OS perspective it can't know much further than "this process consumes a lot of time doing an unknown (but yet user requested) task".
Every few million clock cycles or so, couldn't it just check EIP and see if it's changed or not? That should only add like 20 cycles overhead at the most, every million cycles or so.

Then, if it is the same, take a little look at the code to see if it's a spinlock, and if it is, let other threads have priority until the data it's spinlocking on is changed, then give it back priority.

What is hard about this?

_________________
Post 03 Nov 2009, 23:18
View user's profile Send private message Send e-mail AIM Address Yahoo Messenger MSN Messenger ICQ Number Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2341
Location: Usono (aka, USA)
rugxulo 04 Nov 2009, 17:56
Spinlocks? Isn't this what SSE2's PAUSE (aka, REP NOP) is for?
Post 04 Nov 2009, 17:56
View user's profile Send private message Visit poster's website Reply with quote
Azu



Joined: 16 Dec 2008
Posts: 1159
Azu 04 Nov 2009, 18:42
Code:
@@:  pause
       cmp     [$],12345678
jne     @b    
Uses 100% CPU, so apparently not.

_________________
Post 04 Nov 2009, 18:42
View user's profile Send private message Send e-mail AIM Address Yahoo Messenger MSN Messenger ICQ Number Reply with quote
comrade



Joined: 16 Jun 2003
Posts: 1150
Location: Russian Federation
comrade 05 Nov 2009, 00:14
Azu wrote:
Every few million clock cycles or so, couldn't it just check EIP and see if it's changed or not? That should only add like 20 cycles overhead at the most, every million cycles or so.

Then, if it is the same, take a little look at the code to see if it's a spinlock, and if it is, let other threads have priority until the data it's spinlocking on is changed, then give it back priority.

What is hard about this?


1. There is no instruction to "check the EIP" of another thread. Typically the thread has to be interrupted to retrieve its context information. That is way more than 20 cycles of overhead. Orthogonal to all of this, is a question of sample rate. One would have to be sampling at Nyquist rate to accurately infer whether the EIP is changing or not. This would require one thread to run at double the frequency of another - which is impractical in how SMPs are designed.

2. How do you propose the OS "takes a look at the code", as well as then infer whether "the data it's spinlocking on is changed" ? Spinlocks are backed my memory, and when they are being spun on, the code typically makes no accesses to the data that the spinlock is protecting.

The idea of what you have is right, and is implemented in OSes in higher-level synchronization constructs, such as events, condition variables, semaphores, and critical sections. For example, the Windows critical section has a built-in spinlock for short critical regions, which after a few failed acquisition attempts puts the thread into a wait mode, freeing the CPU to run any other ready thread. Spinlocks are a very low-level synchronization primitive, typically used by the OS kernel itself to synchronize in areas where higher-level primitives cannot be used - for example, to synchronize the thread scheduler itself! Besides being used in these situations, spinlocks are also useful to protect very short critical regions - for example, queued lists Smile In these cases, its cheaper to spin on a variable then it is to a put a thread to sleep. Like I mentioned before, the Windows critical section is aware of that phenomena and that is why it has a built-in spinlock as the first form of synchronization.

You should pick-up a college book on Operating Systems, even something like Windows Internals would be very useful.

_________________
comrade (comrade64@live.com; http://comrade.ownz.com/)
Post 05 Nov 2009, 00:14
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger ICQ Number Reply with quote
Azu



Joined: 16 Dec 2008
Posts: 1159
Azu 05 Nov 2009, 00:32
comrade wrote:
Azu wrote:
Every few million clock cycles or so, couldn't it just check EIP and see if it's changed or not? That should only add like 20 cycles overhead at the most, every million cycles or so.

Then, if it is the same, take a little look at the code to see if it's a spinlock, and if it is, let other threads have priority until the data it's spinlocking on is changed, then give it back priority.

What is hard about this?


1. There is no instruction to "check the EIP" of another thread.

How does the scheduler pass control between threads if it doesn't know their EIP? I don't understand. If it can't, then how the hell do threads work?????



comrade wrote:
Typically the thread has to be interrupted to retrieve its context information. That is way more than 20 cycles of overhead.
???
It interrupts them anyways, to see if there's a higher priority thread waiting to run. Why would it be expensive to check how much the EIP has changed during these breaks, when the thread is already interrupted?

comrade wrote:
Orthogonal to all of this, is a question of sample rate. One would have to be sampling at Nyquist rate to accurately infer whether the EIP is changing or not. This would require one thread to run at double the frequency of another - which is impractical in how SMPs are designed.
I don't understand what you mean. Why can't the scheduler check when it interrupts the thread?

comrade wrote:
2. How do you propose the OS "takes a look at the code", as well as then infer whether "the data it's spinlocking on is changed" ?
If EIP is the same after a long time, then the thread might be spinlocking, and the OS should briefly analyze the area around EIP for common spinlock constructions, like a section of memory being checked and a conditional jmp after that going right back to the cmp. If it finds one then just set a write breakpoint on that memory and resume the thread when it is written to... and until then the thread can just sit there idle rather than taking 100% of whatever core it is on..

_________________
Post 05 Nov 2009, 00:32
View user's profile Send private message Send e-mail AIM Address Yahoo Messenger MSN Messenger ICQ Number Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.