flat assembler
Message board for the users of flat assembler.

Index > Main > 64bit RingBuffered Array QUEUE !!!!

Goto page 1, 2, 3  Next
Author
Thread Post new topic Reply to topic
r22



Joined: 27 Dec 2004
Posts: 805
r22
Using the LOCK prefix and CMPXCHG I've created a simple concurrent queue which uses a circular buffer.

Code:
.data
qEnqu dq 0
qPoll   dq 0
qHead dq 0
qEnd   dq 0

.code
;;;INPUT rcx = size to make array stack
;;;returns NONZERO = SUCCESS  0 = FAIL
QueueInit:
        shl     rcx,3 
        mov     rdx,rcx ;;; size to qword 
        xor     ecx,ecx 
        push    rdx ;;; save 
        mov     r8d,MEM_COMMIT or MEM_RESERVE 
        mov     r9d,PAGE_READWRITE 
        call    [VirtualAlloc] 
        test    rax,rax 
        jz      .fail 
        mov     qword[qEnqu],rax
        mov     qword[qHead],rax
        mov     qword[qPoll],rax
        pop     rcx
        lea     rax,[rax+rcx-8]
        mov     qword[qEnd],rax
        ret     0 
   .fail: 
        xor     eax,eax 
        invoke  ExitProcess,0 
        ret     0 

;;;return NONZERO = SUCCESS  0 = FAIL 
QueueDelete:
        mov     rcx,qword[qHead]
        xor     edx,edx 
        mov     r8d,MEM_RELEASE 
        call    [VirtualFree] 
        test    rax,rax 
        jz      .fail 
        ret     0 
   .fail: 
        xor     eax,eax 
        ret     0

;;;RCX = addr of data to put to the stack
;;;return NONZERO = success  0 = fail
QueueAdd:
        mov     rax,qword[qEnqu]
   .loop:
        lea     rdx,[rax+8]
        test    qword[rax],0 ;;;COMMENT OUT IF you want the queue
        jnz     .fail        ;;;to overwrite unPolled data
        cmp     rax,qword[qEnd]
        jnz     .skip
        mov     rdx,qword[qHead]
   .skip:
   lock cmpxchg qword[qEnqu],rdx
        jnz     .loop
        mov     qword[rax],rcx
        ret     0
   .fail:
        xor     eax,eax
        ret     0

;;;return data addr = success  0 = failure
QueueRemove:
        xor     ecx,ecx
        mov     rax,qword[qPoll]
   .loop:
        lea     rdx,[rax+8]
        test    qword[rax],0
        jz      .fail
        cmp     rax,qword[qEnd]
        jnz     .skip
        mov     rdx,qword[qHead]
   .skip:
   lock cmpxchg qword[qPoll],rdx
        jnz     .loop
        mov     rcx,rax
        xor     eax,eax
   lock xchg    qword[rcx],rax
        ret     0
   .fail:
        xor     eax,eax
        ret     0
    


If someone wants to make a circular array queue that uses a Critical Section or Mutex we can make an overly elaborate multithreaded benchmark and learn which Concurrency method is superior.

I'm open to any ideas in optimizing the above code OR bug fixes :P
Currently it uses test for null to make sure it hasn't overlaped itself, which works fine since the Queue takes data addresses which (for the most part) can't be 0000000000000000h.
Post 23 Sep 2006, 00:20
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
I suggest see this http://www.cs.rochester.edu/research/synchronization/pseudocode/queues.html (There are both ways there, lock and lock free). Very interesting and apparently it works OK.

I'm not saying that r22's code is wrong though, I'm only showing an alternative data structure.
Post 02 Oct 2008, 20:31
View user's profile Send private message Reply with quote
Feryno



Joined: 23 Mar 2005
Posts: 454
Location: Czech republic, Slovak republic
Feryno
missing something like sub rsp,8*4 at the procedure prologue / add rsp,8*4 at procedure epilogue (+ stay with rsp aligned at dqword boundary) - to prevent the OS to destroy return address from the procedure, OS often uses 4 qwords at [rsp], [rsp+8], [rsp+16], [rsp+24] to store registers
Post 03 Oct 2008, 05:55
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2887
Location: [RSP+8*5]
bitRAKE
Doesn't an XCHG with memory have an implied LOCK (i.e. LOCK prefix is not require)? It's the only instruction I've ever used for something like this.
Code:
       xor rax,rax
.wait: xchg [ptr_flag],rax
       test rax,rax
       je .wait    
...additionally, the lower bits could be used for flags if all pointers are aligned (changing the TEST to mask off those bits). This method could be used to limit access to the buffer pointer, and/or on the buffer items themselves (less waiting, imho).

What is "test qword[rax],0" besides zero? Seems strange.
QueueRemove will always fail.

_________________
¯\(°_o)/¯ unlicense.org
Post 05 Oct 2008, 01:19
View user's profile Send private message Visit poster's website Reply with quote
asmfan



Joined: 11 Aug 2006
Posts: 392
Location: Russian
asmfan
Actually you can do anything using Spinlocks but i don't think it is possible to do "simultaneous" safe changing/setting/ more than 8/16 bytes on x32/64 while addresses require 4/8 bytes hence 2 values at once.

_________________
Any offers?
Post 05 Oct 2008, 17:45
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17247
Location: In your JS exploiting you and your system
revolution
From LocoDelAssembly (since he cannot post it):

The link I posted has a version that works OK, the "CAS" for us is "lock cmpxchg8b/16b" and it is used to avoid the ABA problem (a pointer plus a counter gets updated in a single atomic step). The important feature also is that there is no spinlocking, no thread awaits another (like r22's QueueAdd if I'm looking right).
Post 05 Oct 2008, 22:31
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
OK, I have just finished a simple test to verify one design flaw I was expecting it to have. Check attachments.

If someone see any bugs (besides not checking for API errors Wink) please post your finding.

Special thanks to RedGhost who provided an exception handling example which was useful for me to produce this test.


Description:
Download
Filename: queues.zip
Filesize: 2.93 KB
Downloaded: 243 Time(s)

Post 08 Oct 2008, 21:43
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Sorry, there was an error in a comment.

I also forgot to explain what is this program testing... It is simulating the scenario in which a low priority thread makes to update qEnqu pointer but gets immediately preempted by higher priority threads. The higher priority threads will succeed in queuing their own data, but when all of them will try to dequeue an element they will continually fail because the low priority thread haven't set the element on the array yet so the value of the first element is still zero. Since the high priority threads are still consuming lots of CPU time, the lower priority thread takes a lot of time to execute "mov [eax], ecx" (it is below ".putElement:"). Once the lower priority thread makes into setting its queue slot the other threads are able to complete their work inmedatelly and the simulation finishes. It takes in average 5 seconds on my computer. By uncommenting "THREAD_PRIORITY_TIME_CRITICAL=THREAD_PRIORITY_LOWEST" the average time is reported to be "0 ms" instead of around "5200 ms".

What I was trying to prove with all this? Only that if threads of heterogeneous priorities exists in the process, the queue can be mistakenly reported as empty when it is not (supposing than a failed dequeue means "empty queue") and/or the program could stall for a while. And in the case of homogeneous priorities the problem of having a failed dequeue even though when there are several slots with data is still possible, but the problem should last considerably less time than in the simulated case.

I'm planning to port the pseudo-codes of the paper into fasm too, so we can study the pros and cons of the three versions and conduct some benchmarks (probably made by someone else Wink).

[edit]Something perhaps not so obvious till you download the attachment, I ported the code to 32-bit because I don't have a 64-bit Windows and 32-bit is better for the sake of testing since more people can see this working on their own computers[/edit]
Post 09 Oct 2008, 00:51
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17247
Location: In your JS exploiting you and your system
revolution
I haven't seen your code but I am going to guess at how it works (I'm just feeling brave at this moment!). When ever your thread sees that the locking variable is in the "busy" state, it should use a sleep function to allow the other thread to complete it's access to the queue. This is the standard way of doing multiple thread synchronisation.

Now, if this is what you have already done, then just forget about what I said just now. Razz

Later, when I have more time to spare I will try to have a look at the code and see what is really going on.
Post 09 Oct 2008, 01:45
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
The problem is that this queue is supposed to be synchronization free and, I think, that if a thread queued something, that something should be available immediately right after the queuing is complete. This is not what it is happening, a low priority thread is able to stall the dequeing of several fully queued items, not only one. With the algorithm of the paper such situations cannot happen (yet, I'll verify it anyway), because every thread is always making progress of either its own queuing or of the queuing of someone else.

With synchronization this wouldn't happen of course but using it would make it a locked queue, and that is what it was tried to avoid. Also I want to note that the no-sleep polling I'm doing with those 16 time critical threads is to simulate heavy load (for example, processing work items dequeued recently), and while those threads are working hard, others won't be able to get work items from the queue even though there are many available but the first one is still zero. Sleeping won't fix the problem because other time critical threads will still being scheduled with much more probability than the lower priority one (till aging benefits the lower priority thread or perhaps even never if the priority is IDLE).

For simplification I made the "hard work" simulation and the "is queue ready to dequeue at least the item I've just queued?" all in a single peace of code.

I'm not discouraging the use of this data structure though, but I'm just warning that under some situations when a thread don't fully completes the enqueue operation, it could prevent others from dequeueing fully enqueued items and also the queue could become full.
Post 09 Oct 2008, 03:12
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
And following on the same problem I found a bug in the queue, the last successfully queued item can be lost if the low priority thread haven't set its queue slot before qEnque wrapped around and a a fully enqueue has been committed over the non-set-yet slot.

I'm attaching the example and this time using SuspendThread/ResumeThread because it is more exact and don't make your computer almost unresponsive for some seconds like the previous example does Razz (But it was necessary to do it that way because that was the scenario I was trying to simulate)

Comment "COMPETITOR = 1" (don't set to zero, comment it or delete it), to see that a single enqueuer don't produce this effect. Also you can comment the SuspendThread call and increase queue capacity to convince yourself that this problem only occurs in the specified scenario.

Any ideas of how to solve this problem? (Again, without resorting to synchronization elements).


Description:
Download
Filename: queue_bug.zip
Filesize: 2.94 KB
Downloaded: 221 Time(s)



Last edited by LocoDelAssembly on 09 Oct 2008, 17:21; edited 1 time in total
Post 09 Oct 2008, 15:28
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17247
Location: In your JS exploiting you and your system
revolution
LocoDelAssembly: I still haven't had a chance to see your code but I will ask this question: Are you using a spin lock? If so, then you probably need to provide some method for the higher priority thread to yield to a low priority thread if they are run on the same CPU core. Do you have a mechanism to check other threads to see if they are waiting? Windows does provide priority inversion but it can take quite some time for that mechanism to finally sort out the problem if you are not yielding in any other way. It is a known problem with different priority threads to cause a deadlock and MS's solution is priority inversion.
Post 09 Oct 2008, 15:37
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
revolution, see my code Smile

There is no spin locking here, only atomic operations that are synchronization free and most of the time it works, it is only failing when the queue wraps around and qPoll points to a slot that has an update in progress by some thread. In fact this bug doesn't need a low priority thread to be triggered like the previous example, but it is more likely to occur if such low priority thread is enqueuing something. This time the simulation simulates the case when the first enqueuer gets preempted by other thread that fills the entire queue, and since the slots are considered empty when them are zero, this thread sets an element on the slot of the preempted thread and, when the preempted thread resumes execution this last element gets overwritten.

Although very unlikely, this situation can happen and it is somewhat fatal, the other one was a recoverable situation somehow, but this one is not.
Post 09 Oct 2008, 15:49
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17247
Location: In your JS exploiting you and your system
revolution
So use a spin lock (with sleep) then, that's what they are for. Why do you need to avoid using a lock anyway? It would seem to be no good making something that only works 99.9% of the time and then having to "hope for the best" when it is run.
LocoDelAssembly wrote:
revolution, see my code
Oh man, I am so deep into another project that right now I can't afford to break my train of thought. Perhaps in a day or two I can be free to peruse through the code.
Post 09 Oct 2008, 15:58
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2887
Location: [RSP+8*5]
bitRAKE
QueueInit doesn't fail when it fails:
Code:
;;;INPUT ecx = size to make array stack
;;;returns NONZERO = SUCCESS  0 = FAIL
QueueInit:
        push    ecx                ; bitRAKE: non-destructive of ECX
        shl     ecx, 2
        push    PAGE_READONLY
        push    MEM_COMMIT
        push    ecx
        push    0
        call    [VirtualAlloc] 
        test    eax, eax
        jz      .fail 
        mov     [qEnqu], eax
        mov     [qHead], eax
        mov     [qPoll], eax
        pop     ecx
        lea     eax, [eax+ecx*4-4] ; bitRAKE: non-destructive of ECX
        mov     [qEnd], eax
        ret
   .fail:
;        not     eax  ; bitRAKE: not needed, or should be "mov eax,0"
        ret    
The rest looks good. Using CMOVZ would be another option:
Code:
QueueAdd:
        mov     eax, [qEnqu]
   .loop:
        ; advance to next item in EDX, wrap around at end
        lea     edx, [eax+4]
        cmp     eax, [qEnd]
        cmovz   edx, [qHead]

        cmp     dword [eax], 0 ; Do we have the slot free?
        jne     .fail
   lock cmpxchg [qEnqu], edx
        jnz     .loop

.putElement:
        mov     [eax], ecx
        ret

   .fail:
        xor     eax, eax
        ret    
Maybe, you have intended to use ECX in QueueRemove similar to:
Code:
;;;return data addr = success  0 = failure
QueueRemove:
        xor     ecx, ecx
        mov     eax, [qPoll]
   .loop:
        ; advance to next item in EDX, wrap around at end
        lea     edx, [eax+4]
        cmp     eax, [qEnd]
        cmovz   edx, [qHead]

        cmp     dword [eax], 0
        je      .fail

   lock cmpxchg [qPoll], edx
        jnz     .loop

        xchg    eax, ecx
        xchg    [ecx], eax
        ret

   .fail:
        xchg    eax, ecx
        ret    
I've had two thread hang and I've run it less than a dozen times!

_________________
¯\(°_o)/¯ unlicense.org


Last edited by bitRAKE on 09 Oct 2008, 17:03; edited 1 time in total
Post 09 Oct 2008, 16:39
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
bitRAKE wrote:

QueueInit doesn't fail when it fails:


Damn, I don't know how I managed to think that non-zero means failure and success at the same time Laughing

About cmovz, I haven't used it because I raise the minimum CPU requirement to PPro with that but the 64-bit version should use it definitively.

Thanks bitRAKE
Post 09 Oct 2008, 17:03
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2887
Location: [RSP+8*5]
bitRAKE
Looks like you are accessing the thread handles with one based indexing instead of zero based.
Code:
invoke  WaitForSingleObject, [hThreads+ebx*4], -1    
When EBX is THREADS then the DWORD after the hThreads array is pushed. Also, the loop exits when EBX is zero - which skips the first handle in the array.
Code:
invoke  WaitForSingleObject, [hThreads+4*(ebx-1)], -1    
...fixes the error (FASM is so smart like that Very Happy ). Another option would be the single API call:
Code:
invoke  WaitForMultipleObjects, THREADS, hThreads, TRUE, -1    
...can't see a reason not to.

_________________
¯\(°_o)/¯ unlicense.org
Post 09 Oct 2008, 17:24
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
hehe right, but note that I have made the very same mistake in the creation loop so I'm waiting for real handlers anyway (I'm using from hThreads[1] to buff[0..3]).

About WaitForMultipleObjects, I avoided it by I don't remember what silly reason (maybe I was hurried to read the documentation to make sure I use it properly).

Anyway, the wait loop was added just to make sure the message box won't lie, but actually when the human eye is able to see the message box, all the other threads have executed ExitThread with a huge level of certainty. (The waiter is the very same thread that gets preempted at QueueAdd.putElement)
Post 09 Oct 2008, 17:46
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2887
Location: [RSP+8*5]
bitRAKE
Lock-free stack implementation from a discussion on Google Groups:
Code:
; Lock-Free Queue Push 
; extern void acLfQueuePush( ac_LfQueue* pQueue, ac_LfNode* pNode ); 
acLfQueuePush PROC 
  ; Preserve
  push edi
  push esi
  push ebx

  ; Load the back
  mov  esi, [esp + 16]
  mov  edx, [esi + 12]
  mov  edi, [esi + 8]

acLfQueuePush_Retry: 
  ; Prepare the Cas
  mov  eax, 0
  mov  ebx, [esp + 20]
  lea  ecx, [edx + 1]

  ; Try to link the new node
  cmpxchg [edi], ebx
  je  acLfQueuePush_Success

  ; Prepare the Cas
  mov  ebx, eax
  mov  eax, edi

  ; Try to advance the back to the next
  cmpxchg8b qword ptr [esi + 8]
  mov  edi, eax
  jmp  acLfQueuePush_Retry

acLfQueuePush_Success: 

  ; Prepare the Cas
  mov  eax, edi

  ; Try to point the back to the new node
  cmpxchg8b qword ptr [esi + 8]

  ; Restore
  pop  ebx
  pop  esi
  pop  edi

  ret

acLfQueuePush ENDP 


; Lock-Free Queue Pop 
; extern ac_LfNode* acLfQueuePop( ac_LfQueue* pQueue ); 
acLfQueuePop PROC 

  ; Preserve
  push edi
  push esi
  push ebx

  mov  esi, [esp + 16]

acLfQueuePop_Retry: 

  ; Load the front
  mov  edx, [esi + 4]
  mov  edi, [esi]

  ; Load the back
  mov  ecx, [esi + 12]
  mov  ebx, [esi + 8]

  ; Load the next
  mov  eax, [edi]

  ; Validate
  test eax, eax
  je  acLfQueuePop_Fail

  cmp  edi, [esi]
  jne  acLfQueuePop_Retry

  cmp  edx, [esi + 4]
  jne  acLfQueuePop_Retry

  cmp  edi, [esi + 8]
  jne  acLfQueuePop_TryPop

  ; Prepare the Cas
  mov  eax, edi
  mov  edx, ebx
  mov  ebx, eax
  lea  ecx, [edx + 1]

  ; Try to advance the back to the next
  cmpxchg8b qword ptr [esi + 8]
  jmp  acLfQueuePop_Retry

acLfQueuePop_TryPop: 

  ; Load the object
  mov  ecx, [eax + 12]

  ; Prepare the Cas
  mov  ebx, eax
  mov  eax, edi
  mov  edi, ecx
  lea  ecx, [edx + 1]

  ; Try to unlink the next node
  cmpxchg8b qword ptr [esi]
  jne  acLfQueuePop_Retry

  ; Set the object on the popped node
  mov  [eax + 12], edi

acLfQueuePop_Fail: 

  ; Restore
  pop  ebx
  pop  esi
  pop  edi
  ret

acLfQueuePop ENDP    

_________________
¯\(°_o)/¯ unlicense.org
Post 10 Oct 2008, 00:44
View user's profile Send private message Visit poster's website Reply with quote
nop



Joined: 01 Sep 2008
Posts: 165
Location: right here left there
nop
i feel so inadequate cos i have no idea what any of this stuff is for, or how it does it Crying or Very sad i'm just too simple i ges
Post 10 Oct 2008, 01:32
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2, 3  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar.

Powered by rwasa.