flat assembler
Message board for the users of flat assembler.
Index
> Main > 64bit RingBuffered Array QUEUE !!!! Goto page 1, 2, 3 Next |
Author |
|
LocoDelAssembly 02 Oct 2008, 20:31
I suggest see this http://www.cs.rochester.edu/research/synchronization/pseudocode/queues.html (There are both ways there, lock and lock free). Very interesting and apparently it works OK.
I'm not saying that r22's code is wrong though, I'm only showing an alternative data structure. |
|||
02 Oct 2008, 20:31 |
|
Feryno 03 Oct 2008, 05:55
missing something like sub rsp,8*4 at the procedure prologue / add rsp,8*4 at procedure epilogue (+ stay with rsp aligned at dqword boundary) - to prevent the OS to destroy return address from the procedure, OS often uses 4 qwords at [rsp], [rsp+8], [rsp+16], [rsp+24] to store registers
|
|||
03 Oct 2008, 05:55 |
|
bitRAKE 05 Oct 2008, 01:19
Doesn't an XCHG with memory have an implied LOCK (i.e. LOCK prefix is not require)? It's the only instruction I've ever used for something like this.
Code: xor rax,rax .wait: xchg [ptr_flag],rax test rax,rax je .wait What is "test qword[rax],0" besides zero? Seems strange. QueueRemove will always fail. _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
05 Oct 2008, 01:19 |
|
asmfan 05 Oct 2008, 17:45
Actually you can do anything using Spinlocks but i don't think it is possible to do "simultaneous" safe changing/setting/ more than 8/16 bytes on x32/64 while addresses require 4/8 bytes hence 2 values at once.
_________________ Any offers? |
|||
05 Oct 2008, 17:45 |
|
revolution 05 Oct 2008, 22:31
From LocoDelAssembly (since he cannot post it):
The link I posted has a version that works OK, the "CAS" for us is "lock cmpxchg8b/16b" and it is used to avoid the ABA problem (a pointer plus a counter gets updated in a single atomic step). The important feature also is that there is no spinlocking, no thread awaits another (like r22's QueueAdd if I'm looking right). |
|||
05 Oct 2008, 22:31 |
|
LocoDelAssembly 08 Oct 2008, 21:43
OK, I have just finished a simple test to verify one design flaw I was expecting it to have. Check attachments.
If someone see any bugs (besides not checking for API errors ) please post your finding. Special thanks to RedGhost who provided an exception handling example which was useful for me to produce this test.
|
|||||||||||
08 Oct 2008, 21:43 |
|
LocoDelAssembly 09 Oct 2008, 00:51
Sorry, there was an error in a comment.
I also forgot to explain what is this program testing... It is simulating the scenario in which a low priority thread makes to update qEnqu pointer but gets immediately preempted by higher priority threads. The higher priority threads will succeed in queuing their own data, but when all of them will try to dequeue an element they will continually fail because the low priority thread haven't set the element on the array yet so the value of the first element is still zero. Since the high priority threads are still consuming lots of CPU time, the lower priority thread takes a lot of time to execute "mov [eax], ecx" (it is below ".putElement:"). Once the lower priority thread makes into setting its queue slot the other threads are able to complete their work inmedatelly and the simulation finishes. It takes in average 5 seconds on my computer. By uncommenting "THREAD_PRIORITY_TIME_CRITICAL=THREAD_PRIORITY_LOWEST" the average time is reported to be "0 ms" instead of around "5200 ms". What I was trying to prove with all this? Only that if threads of heterogeneous priorities exists in the process, the queue can be mistakenly reported as empty when it is not (supposing than a failed dequeue means "empty queue") and/or the program could stall for a while. And in the case of homogeneous priorities the problem of having a failed dequeue even though when there are several slots with data is still possible, but the problem should last considerably less time than in the simulated case. I'm planning to port the pseudo-codes of the paper into fasm too, so we can study the pros and cons of the three versions and conduct some benchmarks (probably made by someone else ). [edit]Something perhaps not so obvious till you download the attachment, I ported the code to 32-bit because I don't have a 64-bit Windows and 32-bit is better for the sake of testing since more people can see this working on their own computers[/edit] |
|||
09 Oct 2008, 00:51 |
|
revolution 09 Oct 2008, 01:45
I haven't seen your code but I am going to guess at how it works (I'm just feeling brave at this moment!). When ever your thread sees that the locking variable is in the "busy" state, it should use a sleep function to allow the other thread to complete it's access to the queue. This is the standard way of doing multiple thread synchronisation.
Now, if this is what you have already done, then just forget about what I said just now. Later, when I have more time to spare I will try to have a look at the code and see what is really going on. |
|||
09 Oct 2008, 01:45 |
|
LocoDelAssembly 09 Oct 2008, 03:12
The problem is that this queue is supposed to be synchronization free and, I think, that if a thread queued something, that something should be available immediately right after the queuing is complete. This is not what it is happening, a low priority thread is able to stall the dequeing of several fully queued items, not only one. With the algorithm of the paper such situations cannot happen (yet, I'll verify it anyway), because every thread is always making progress of either its own queuing or of the queuing of someone else.
With synchronization this wouldn't happen of course but using it would make it a locked queue, and that is what it was tried to avoid. Also I want to note that the no-sleep polling I'm doing with those 16 time critical threads is to simulate heavy load (for example, processing work items dequeued recently), and while those threads are working hard, others won't be able to get work items from the queue even though there are many available but the first one is still zero. Sleeping won't fix the problem because other time critical threads will still being scheduled with much more probability than the lower priority one (till aging benefits the lower priority thread or perhaps even never if the priority is IDLE). For simplification I made the "hard work" simulation and the "is queue ready to dequeue at least the item I've just queued?" all in a single peace of code. I'm not discouraging the use of this data structure though, but I'm just warning that under some situations when a thread don't fully completes the enqueue operation, it could prevent others from dequeueing fully enqueued items and also the queue could become full. |
|||
09 Oct 2008, 03:12 |
|
LocoDelAssembly 09 Oct 2008, 15:28
And following on the same problem I found a bug in the queue, the last successfully queued item can be lost if the low priority thread haven't set its queue slot before qEnque wrapped around and a a fully enqueue has been committed over the non-set-yet slot.
I'm attaching the example and this time using SuspendThread/ResumeThread because it is more exact and don't make your computer almost unresponsive for some seconds like the previous example does (But it was necessary to do it that way because that was the scenario I was trying to simulate) Comment "COMPETITOR = 1" (don't set to zero, comment it or delete it), to see that a single enqueuer don't produce this effect. Also you can comment the SuspendThread call and increase queue capacity to convince yourself that this problem only occurs in the specified scenario. Any ideas of how to solve this problem? (Again, without resorting to synchronization elements).
Last edited by LocoDelAssembly on 09 Oct 2008, 17:21; edited 1 time in total |
|||||||||||
09 Oct 2008, 15:28 |
|
revolution 09 Oct 2008, 15:37
LocoDelAssembly: I still haven't had a chance to see your code but I will ask this question: Are you using a spin lock? If so, then you probably need to provide some method for the higher priority thread to yield to a low priority thread if they are run on the same CPU core. Do you have a mechanism to check other threads to see if they are waiting? Windows does provide priority inversion but it can take quite some time for that mechanism to finally sort out the problem if you are not yielding in any other way. It is a known problem with different priority threads to cause a deadlock and MS's solution is priority inversion.
|
|||
09 Oct 2008, 15:37 |
|
LocoDelAssembly 09 Oct 2008, 15:49
revolution, see my code
There is no spin locking here, only atomic operations that are synchronization free and most of the time it works, it is only failing when the queue wraps around and qPoll points to a slot that has an update in progress by some thread. In fact this bug doesn't need a low priority thread to be triggered like the previous example, but it is more likely to occur if such low priority thread is enqueuing something. This time the simulation simulates the case when the first enqueuer gets preempted by other thread that fills the entire queue, and since the slots are considered empty when them are zero, this thread sets an element on the slot of the preempted thread and, when the preempted thread resumes execution this last element gets overwritten. Although very unlikely, this situation can happen and it is somewhat fatal, the other one was a recoverable situation somehow, but this one is not. |
|||
09 Oct 2008, 15:49 |
|
revolution 09 Oct 2008, 15:58
So use a spin lock (with sleep) then, that's what they are for. Why do you need to avoid using a lock anyway? It would seem to be no good making something that only works 99.9% of the time and then having to "hope for the best" when it is run.
LocoDelAssembly wrote: revolution, see my code |
|||
09 Oct 2008, 15:58 |
|
bitRAKE 09 Oct 2008, 16:39
QueueInit doesn't fail when it fails:
Code: ;;;INPUT ecx = size to make array stack ;;;returns NONZERO = SUCCESS 0 = FAIL QueueInit: push ecx ; bitRAKE: non-destructive of ECX shl ecx, 2 push PAGE_READONLY push MEM_COMMIT push ecx push 0 call [VirtualAlloc] test eax, eax jz .fail mov [qEnqu], eax mov [qHead], eax mov [qPoll], eax pop ecx lea eax, [eax+ecx*4-4] ; bitRAKE: non-destructive of ECX mov [qEnd], eax ret .fail: ; not eax ; bitRAKE: not needed, or should be "mov eax,0" ret Code: QueueAdd: mov eax, [qEnqu] .loop: ; advance to next item in EDX, wrap around at end lea edx, [eax+4] cmp eax, [qEnd] cmovz edx, [qHead] cmp dword [eax], 0 ; Do we have the slot free? jne .fail lock cmpxchg [qEnqu], edx jnz .loop .putElement: mov [eax], ecx ret .fail: xor eax, eax ret Code: ;;;return data addr = success 0 = failure QueueRemove: xor ecx, ecx mov eax, [qPoll] .loop: ; advance to next item in EDX, wrap around at end lea edx, [eax+4] cmp eax, [qEnd] cmovz edx, [qHead] cmp dword [eax], 0 je .fail lock cmpxchg [qPoll], edx jnz .loop xchg eax, ecx xchg [ecx], eax ret .fail: xchg eax, ecx ret _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup Last edited by bitRAKE on 09 Oct 2008, 17:03; edited 1 time in total |
|||
09 Oct 2008, 16:39 |
|
LocoDelAssembly 09 Oct 2008, 17:03
bitRAKE wrote:
Damn, I don't know how I managed to think that non-zero means failure and success at the same time About cmovz, I haven't used it because I raise the minimum CPU requirement to PPro with that but the 64-bit version should use it definitively. Thanks bitRAKE |
|||
09 Oct 2008, 17:03 |
|
bitRAKE 09 Oct 2008, 17:24
Looks like you are accessing the thread handles with one based indexing instead of zero based.
Code: invoke WaitForSingleObject, [hThreads+ebx*4], -1 Code: invoke WaitForSingleObject, [hThreads+4*(ebx-1)], -1 Code: invoke WaitForMultipleObjects, THREADS, hThreads, TRUE, -1 _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
09 Oct 2008, 17:24 |
|
LocoDelAssembly 09 Oct 2008, 17:46
hehe right, but note that I have made the very same mistake in the creation loop so I'm waiting for real handlers anyway (I'm using from hThreads[1] to buff[0..3]).
About WaitForMultipleObjects, I avoided it by I don't remember what silly reason (maybe I was hurried to read the documentation to make sure I use it properly). Anyway, the wait loop was added just to make sure the message box won't lie, but actually when the human eye is able to see the message box, all the other threads have executed ExitThread with a huge level of certainty. (The waiter is the very same thread that gets preempted at QueueAdd.putElement) |
|||
09 Oct 2008, 17:46 |
|
bitRAKE 10 Oct 2008, 00:44
Lock-free stack implementation from a discussion on Google Groups:
Code: ; Lock-Free Queue Push ; extern void acLfQueuePush( ac_LfQueue* pQueue, ac_LfNode* pNode ); acLfQueuePush PROC ; Preserve push edi push esi push ebx ; Load the back mov esi, [esp + 16] mov edx, [esi + 12] mov edi, [esi + 8] acLfQueuePush_Retry: ; Prepare the Cas mov eax, 0 mov ebx, [esp + 20] lea ecx, [edx + 1] ; Try to link the new node cmpxchg [edi], ebx je acLfQueuePush_Success ; Prepare the Cas mov ebx, eax mov eax, edi ; Try to advance the back to the next cmpxchg8b qword ptr [esi + 8] mov edi, eax jmp acLfQueuePush_Retry acLfQueuePush_Success: ; Prepare the Cas mov eax, edi ; Try to point the back to the new node cmpxchg8b qword ptr [esi + 8] ; Restore pop ebx pop esi pop edi ret acLfQueuePush ENDP ; Lock-Free Queue Pop ; extern ac_LfNode* acLfQueuePop( ac_LfQueue* pQueue ); acLfQueuePop PROC ; Preserve push edi push esi push ebx mov esi, [esp + 16] acLfQueuePop_Retry: ; Load the front mov edx, [esi + 4] mov edi, [esi] ; Load the back mov ecx, [esi + 12] mov ebx, [esi + 8] ; Load the next mov eax, [edi] ; Validate test eax, eax je acLfQueuePop_Fail cmp edi, [esi] jne acLfQueuePop_Retry cmp edx, [esi + 4] jne acLfQueuePop_Retry cmp edi, [esi + 8] jne acLfQueuePop_TryPop ; Prepare the Cas mov eax, edi mov edx, ebx mov ebx, eax lea ecx, [edx + 1] ; Try to advance the back to the next cmpxchg8b qword ptr [esi + 8] jmp acLfQueuePop_Retry acLfQueuePop_TryPop: ; Load the object mov ecx, [eax + 12] ; Prepare the Cas mov ebx, eax mov eax, edi mov edi, ecx lea ecx, [edx + 1] ; Try to unlink the next node cmpxchg8b qword ptr [esi] jne acLfQueuePop_Retry ; Set the object on the popped node mov [eax + 12], edi acLfQueuePop_Fail: ; Restore pop ebx pop esi pop edi ret acLfQueuePop ENDP _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
10 Oct 2008, 00:44 |
|
nop 10 Oct 2008, 01:32
i feel so inadequate cos i have no idea what any of this stuff is for, or how it does it i'm just too simple i ges
|
|||
10 Oct 2008, 01:32 |
|
Goto page 1, 2, 3 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.