flat assembler
Message board for the users of flat assembler.

Index > Tutorials and Examples > Accurate multi-threaded 64-bit counters on a 32-bit machine

Goto page Previous  1, 2, 3  Next
Author
Thread Post new topic Reply to topic
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20430
Location: In your JS exploiting you and your system
revolution 26 Apr 2013, 15:20
bitRAKE wrote:
Using the notation from the manual with a slight modification:
Code:
READ A.1.1
READ A.1.0
        READ B.1.0
        READ C.1.0
        WRITE B.1.0
        WRITE C.1.0
        READ A.1.0
        WRITE A.1.0

READ A.2.0
READ A.2.1
        READ B.2.1
        READ C.2.1
        WRITE B.2.1
        WRITE C.2.1
        READ A.2.1
        WRITE A.2.1    
...where READ.A.2.1 means a read operation at location A, processor 2, and lock bit is 1. As long as processor ordering is preserved, what is the ordering (shuffling) which produces an error?
I don't know where, or if, there is any error. Maybe it will work perfectly on your machine. But your code is not lockless, right? You are using a bitmap to control access to the counters. Or have I misunderstood what you are doing?

If you have a lock then you don't need to worry about the ordering of reads/writes at the hardware level. That is why locks are used, so that the programmer does not need to be concerned about different systems breaking the code.
Post 26 Apr 2013, 15:20
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20430
Location: In your JS exploiting you and your system
revolution 26 Apr 2013, 15:37
Okay I looked further into your code. It looks semantically correct for the two thread scenario and I would expect it to function well. But you incorrectly describe it as lockless when it is in fact using locks for each counter. The locks are contained in the bitmap.

However there appears to be a large flaw with trying to use it on more than two threads/CPUs at one time. Your code appears to not be extendible to an arbitrary number of threads. You could of course alter it for more threads but then it devolves into the standard data-store-with-an-associated-external-lock scenario.
Post 26 Apr 2013, 15:37
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4061
Location: vpcmpistri
bitRAKE 26 Apr 2013, 21:16
It does work generally.

1) threads only signal one other thread
2) last write is signal update

This insures the sequential operation. Although the write order between processors is not guaranteed, the write order of a single processor is ordered. (Speaking about the main memory bus - multi-chip/core.) So, this describes a sequential fragmented stream of instructions - not a parallel process. It is a lock in that sense, but not a typical lock. It would be advantageous to have several such sequential streams running in parallel.

Not sending the bus lock is important.
Post 26 Apr 2013, 21:16
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20430
Location: In your JS exploiting you and your system
revolution 28 Apr 2013, 08:42
bitRAKE wrote:
It does work generally.

1) threads only signal one other thread
2) last write is signal update

This insures the sequential operation. Although the write order between processors is not guaranteed, the write order of a single processor is ordered. (Speaking about the main memory bus - multi-chip/core.) So, this describes a sequential fragmented stream of instructions - not a parallel process. It is a lock in that sense, but not a typical lock. It would be advantageous to have several such sequential streams running in parallel.

Not sending the bus lock is important.
I think you missed my point above. Bus lock is not what is making your code work, it is the software lock. But even so, there is a flaw when the threads are updating different counts:
Code:
;...
TEST_COUNT1 = 1000
TEST_COUNT2 = 9000
;...
          invoke  CreateThread,NULL,0,[address#i],TEST_COUNT#i,NULL,thread#i
;...    
The code locks up at 100% usage with one thread running out of available locks. The hardware bus lock might not be activated but it makes no difference when the code runs around in loops forever waiting in vain for a software lock to be released. Without the underlying assumption that the threads will count the same value then the code is not useful in any general setup.

Also I would be keen to see how you would make the code use arbitrary thread counts.
Post 28 Apr 2013, 08:42
View user's profile Send private message Visit poster's website Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode 29 Apr 2013, 14:35
bitRAKE's solution is in all cases faster, granted locking is on all involved memory addresses.
here using 8 threads, different counts, and adder value on my Quad Yorkfield
Quote:
revolution
Counter:5254850 Time:203
Counter:8360000000000000 Time:4243

lock on both 32 addresses without JNC
Counter:3710357 Time:249
Counter:8360000000000000 Time:1732

lock on both 32 addresses using JNC
Counter:3816022 Time:187
Counter:8360000000000000 Time:1092

bitRAKE solution should be the choice for big numbers in mem. that 35 cycles branch overhead from JNC is worth more than
2 consecutive locks. having prefetch then on larger memory, it can improve 10%~15% on the best result here.
Cheers,
Very Happy

_________________
⠓⠕⠏⠉⠕⠙⠑
Post 29 Apr 2013, 14:35
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20430
Location: In your JS exploiting you and your system
revolution 29 Apr 2013, 14:41
hopcode: Can you show your code please. I couldn't get bitRAKE's code to work with more then two threads or with independent counts on each thread.
Post 29 Apr 2013, 14:41
View user's profile Send private message Visit poster's website Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode 29 Apr 2013, 14:52
Code:
;--- DATA 
NUMT equ 8
 rept NUMT i {
   thread#i dd ?
  }
  sig_start dd 0
  ;--- CODE 

  rept NUMT i {
   invoke  CreateThread,NULL,0,[address],TEST_COUNT+i*10000,NULL,thread#i
   mov  [thread#i],eax
 }
  mov  [sig_start],1
  invoke  WaitForMultipleObjects,NUMT,thread1,TRUE,-1
    


Code:
proc safe_thread count
.start:
 lock and [sig_start],1         ;--- avoid ResumeThread overhead
 jz .start
 mov esi,1000000000   ;--- possible adder value
.loop:
 lock add dword[counter+0],esi
 jnc    @f
 lock adc dword[counter+4],0  ;--- lock this only when needed
 ;--- @add64 [counter] ;--- revolution renamed macro
@@:
 dec  [count]
 jnz  .loop
 ret
endp
    

should work fine

_________________
⠓⠕⠏⠉⠕⠙⠑
Post 29 Apr 2013, 14:52
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20430
Location: In your JS exploiting you and your system
revolution 29 Apr 2013, 15:02
hopcode: Your "safe_thread" is not safe at all. You can't use 32-bit code like that without some form of lock to prevent other threads reading the half updated values.
Post 29 Apr 2013, 15:02
View user's profile Send private message Visit poster's website Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode 29 Apr 2013, 15:05
agree, because there is no signaling during carry propagation.

_________________
⠓⠕⠏⠉⠕⠙⠑
Post 29 Apr 2013, 15:05
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20430
Location: In your JS exploiting you and your system
revolution 29 Apr 2013, 15:07
hopcode: Can you show your solution with bitRAKE's code for more than 2 threads and independent counters please? I am curious to see how you extended to get it to work.
Post 29 Apr 2013, 15:07
View user's profile Send private message Visit poster's website Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode 29 Apr 2013, 15:23
i was working on your solution, because bitmaps "delays" the problem to another stage, and it may be too much contextual, where LOCK is already atomic on memory addresses.
anyway i doubt there is a better solution than the 2 locks+JNC as from here,
http://board.flatassembler.net/topic.php?p=155105#155105
it is not that bad i think, considering your macro does lock-read + modify + lock-write

edit:... a moment, perhaps i am missing something Smile

_________________
⠓⠕⠏⠉⠕⠙⠑
Post 29 Apr 2013, 15:23
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20430
Location: In your JS exploiting you and your system
revolution 29 Apr 2013, 15:44
Oh, you were working from the older code. I thought you were using bitRAKE's later code.
Post 29 Apr 2013, 15:44
View user's profile Send private message Visit poster's website Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode 29 Apr 2013, 16:14
no, because the BTS doesnt assert LOCK intrinsically, and for this reason the method should be avoided even when working on little bitmaps, just because LOCK is atomical and may invalidate the cacheline, while the opposite is false.
in fact trying prefetch to avoid locking on datas in the same lines that do not require update doesnt work. the correct is the second code from bitRAKE, but using 2 locks. now confirmed, using 8 different counter too: same timing as above (relatively)

_________________
⠓⠕⠏⠉⠕⠙⠑
Post 29 Apr 2013, 16:14
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4061
Location: vpcmpistri
bitRAKE 29 Apr 2013, 18:49
Sorry, I haven't gotten back to this. I'm working on a general solution. First it will build a model of computation and then apply a concurrency model to execute it with arbitrary threads. So, we can better test specific types of processing (memory intensive, processor intensive, thread dependencies etc.); and specific models of concurrency.

We should make a distinction between the software lock, the lock instruction and the hardware lock# signal. The software lock allows the memory controller to do it's job uninterrupted, and that is why it's faster (even when there is memory contention between cores).

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 29 Apr 2013, 18:49
View user's profile Send private message Visit poster's website Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode 30 Apr 2013, 05:36
agree with bitRAKE, needed general guidelines summarizing the matter. from several manuals.
example CHAPTER 8 MULTIPLE-PROCESSOR MANAGEMENT from
Intel® 64 and IA-32 Architectures Software Developer’s Manual Combined Volumes:1, 2A, 2B, 2C, 3A, 3B, and 3C
( http://download.intel.com/products/processor/manual/325462.pdf ) explaining something but that is not all.

it's not simple to find an explanation why it works/does-not-work without testing it, and this should be the goal, imo Smile
we may find XCHG good because asserting LOCK on older processor, or 2 consecutive locked instructions working fine on newer ones
because of cache mechanisms, and considering the privilege level to low to modify in protected mode the way to access memory (WP/WC etc)
ehm, that job is hard...but worth
Cheers,
Very Happy

_________________
⠓⠕⠏⠉⠕⠙⠑
Post 30 Apr 2013, 05:36
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20430
Location: In your JS exploiting you and your system
revolution 30 Apr 2013, 06:57
I've updated the 64-bit read macro to not use the bus lock to read the value. Instead I use the trick of reading one of the memory contents twice and comparing to see if it changed. Note that this only works correctly when the 64-bit writes are guaranteed atomic so the 'lock cmpxchg8b' instruction still needs to be used for writes to make this work.

Plus I've updated the test to allow arbitrary thread counts by setting an assembly time constant, and combined it all into a single file for testing.

Note that the function named 'unsafe_thread_2' can't be broken with this test code but it is still a broken implementation of updating a 64-bit value atomically. A different version of the test code could expose the flaw when both the low and high parts of the 64-bit value are required before the result is written back. A simple adder, as used here, will in fact work correctly and the code is faster than both versions of the safe_thread functions. For some circumstances, if the speed difference is important, then then 'unsafe_thread_2' could be suitable for use, but the use restrictions must be properly considered.
Code:
TEST_COUNT      = 10000000
ADDER_VALUE     = 10000000000
THREAD_COUNT    = 2

include 'win32ax.inc'

macro old_atomic_64_bit_read address {
        ;thread safe read 64-bit memory location. read edx:eax from memory
        ;ebx and ecx are corrupted
        mov     ebx,eax         ;in case of match place back the same value (non-destructive read)
        mov     ecx,edx         ;in case of match place back the same value (non-destructive read)
        lock cmpxchg8b [address]        ;value is returned in edx:eax (edx is high order 32 bits, eax is low order 32 bits)
}

macro old_atomic_64_bit_write address {
        ;thread safe write 64-bit memory location. write ecx:ebx to memory (ecx is high order 32 bits, ebx is low order 32 bits)
        ;edx and eax are corrupted
        local   .retry
    .retry:
        lock cmpxchg8b [address]        ;write ecx:ebx to memory if edx:eax matches current value
        jnz     .retry          ;if no match then try again with new value in edx:eax
}

macro old_atomic_64_bit_op address,op1,op2 {
        ;thread safe operation of edi:esi with the 64-bit memory location (edi is high order 32 bits, esi is low order 32 bits)
        ;eax, ebx, ecx and edx are corrupted
        ;firstly read the value in the memory location
        local   .retry
        old_atomic_64_bit_read address
    .retry:
        ;transfer to working registers ecx:ebx
        mov     ebx,eax
        mov     ecx,edx
        ;do the requested operation
        op1     ebx,esi
        op2     ecx,edi
        ;try to write it back
        lock cmpxchg8b [address]        ;write ecx:ebx to memory only if edx:eax matches current value (i.e. no one else has changed the value)
        jnz     .retry          ;if the memory value has changed then try again with new value in edx:eax
}

macro old_atomic_64_bit_add address { old_atomic_64_bit_op address,add,adc }
macro old_atomic_64_bit_sub address { old_atomic_64_bit_op address,sub,sbb }
macro old_atomic_64_bit_xor address { old_atomic_64_bit_op address,xor,xor }
macro old_atomic_64_bit_or  address { old_atomic_64_bit_op address, or, or }
macro old_atomic_64_bit_and address { old_atomic_64_bit_op address,and,and }
macro old_atomic_64_bit_inc address {
        xor     edi,edi
        mov     esi,1
        old_atomic_64_bit_add address
}
macro old_atomic_64_bit_not address {
        or      edi,-1
        or      esi,-1
        old_atomic_64_bit_xor address
}

;newer macros below

macro atomic_64_bit_read address {
        ;thread safe read 64-bit memory location. read edx:eax from memory
        ;ecx is corrupted
        local   .again
    .again:
        mov     ecx,dword[address+0]
        mov     edx,dword[address+4]
        mov     eax,dword[address+0]
        cmp     ecx,eax
        jnz     .again
}

macro atomic_64_bit_write address {
        ;thread safe write 64-bit memory location. write ecx:ebx to memory (ecx is high order 32 bits, ebx is low order 32 bits)
        ;edx and eax are corrupted
        local   .retry
    .retry:
        lock cmpxchg8b [address]        ;write ecx:ebx to memory if edx:eax matches current value
        jnz     .retry          ;if no match then try again with new value in edx:eax
}

macro atomic_64_bit_op address,op1,op2 {
        ;thread safe operation of edi:esi with the 64-bit memory location (edi is high order 32 bits, esi is low order 32 bits)
        ;eax, ebx, ecx and edx are corrupted
        ;firstly read the value in the memory location
        local   .retry
        atomic_64_bit_read address
    .retry:
        ;transfer to working registers ecx:ebx
        mov     ebx,eax
        mov     ecx,edx
        ;do the requested operation
        op1     ebx,esi
        op2     ecx,edi
        ;try to write it back
        lock cmpxchg8b [address]        ;write ecx:ebx to memory only if edx:eax matches current value (i.e. no one else has changed the value)
        jnz     .retry                  ;if the memory value has changed then try again with new value in edx:eax
}

macro atomic_64_bit_add address { atomic_64_bit_op address,add,adc }
macro atomic_64_bit_sub address { atomic_64_bit_op address,sub,sbb }
macro atomic_64_bit_xor address { atomic_64_bit_op address,xor,xor }
macro atomic_64_bit_or  address { atomic_64_bit_op address, or, or }
macro atomic_64_bit_and address { atomic_64_bit_op address,and,and }
macro atomic_64_bit_inc address {
        xor     edi,edi
        mov     esi,1
        atomic_64_bit_add address
}
macro atomic_64_bit_not address {
        or      edi,-1
        or      esi,-1
        atomic_64_bit_xor address
}

.data

        counter dq ?
        threads rd THREAD_COUNT
        message rb 1 shl 12

.code

start:
        stdcall run_test,unsafe_thread,'unsafe'
        stdcall run_test,unsafe_thread_2,'unsafe2'
        stdcall run_test,old_safe_thread,'old safe'
        stdcall run_test,safe_thread,'safe'
        invoke  MessageBox,0,message,'Safe or not?',0
        invoke  ExitProcess,0

proc run_test uses esi edi ebx,address,name
        xor     ecx,ecx
        xor     ebx,ebx
        atomic_64_bit_write counter
        invoke  GetTickCount
        mov     edi,eax
        mov     esi,threads
    .thread_loop:
        invoke  CreateThread,NULL,0,[address],TEST_COUNT,NULL,esi
        mov     [esi],eax
        add     esi,4
        cmp     esi,threads+THREAD_COUNT*4
        jnz     .thread_loop
        invoke  WaitForMultipleObjects,THREAD_COUNT,threads,TRUE,-1
        invoke  GetTickCount
        sub     edi,eax
        neg     edi
        invoke  lstrlen,message
        invoke  wsprintf,addr message+eax,<'%s',9,'Count:%I64u Time:%u',13,10>,[name],double [counter],edi
        ret
endp

proc unsafe_thread count
        mov     esi,ADDER_VALUE and 0xffffffff
        mov     edi,ADDER_VALUE shr 32
    .loop:
        mov     eax,dword[counter+0]
        mov     edx,dword[counter+4]
        add     eax,esi
        adc     edx,edi
        mov     dword[counter+0],eax
        mov     dword[counter+4],edx
        dec     [count]
        jnz     .loop
        ret
endp

proc unsafe_thread_2 count
        mov     esi,ADDER_VALUE and 0xffffffff
        mov     edi,ADDER_VALUE shr 32
    .loop:
        lock
        add     dword[counter+0],esi
        lock
        adc     dword[counter+4],edi
        dec     [count]
        jnz     .loop
        ret
endp

proc old_safe_thread count
        mov     esi,ADDER_VALUE and 0xffffffff
        mov     edi,ADDER_VALUE shr 32
    .loop:
        old_atomic_64_bit_add counter
        dec     [count]
        jnz     .loop
        ret
endp

proc safe_thread count
        mov     esi,ADDER_VALUE and 0xffffffff
        mov     edi,ADDER_VALUE shr 32
    .loop:
        atomic_64_bit_add counter
        dec     [count]
        jnz     .loop
        ret
endp

.end start    
Code:
unsafe  Count:132691383273074688 Time:374
unsafe2 Count:200000000000000000 Time:811
oldsafe Count:200000000000000000 Time:1857
safe    Count:200000000000000000 Time:1606    
Post 30 Apr 2013, 06:57
View user's profile Send private message Visit poster's website Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode 30 Apr 2013, 08:47
Code:
    .again:
        mov     ecx,dword[address+0]
        mov     edx,dword[address+4]
        mov     eax,dword[address+0]
        cmp     ecx,eax
        jnz     .again
    

this may have flaws too, for the same reason i agreed with you above about carry propagation on hi address. and you have here
the latency of 2 more not-locking-MOV (danger!! Wink ) and a JNZ.
the method using 2-LOCK and JNC has the disadvantage to assume the hi register is correct. this has lead me to think that we are only trying
good programming design, on the other side neglecting preventing an eventual bad modification/access to a resource (the 64bit counter ).
the advantages are but worth to be considered.

in my further tests, i run 2 threads safe+unsafe at the same time, and none of the solutions above obviously works on the counter.
you may say it is a matter of further design in the safe proc, ok. but the unsafe thread may be an external user's code allowed to access
the counter. and this is a frequent real-case. also i wonder whether there is an abstract method to share a resource and check
for its safety during access/modification from user-end-thread with very few code.

i'll try later today the updates,
Cheers,
Very Happy

_________________
⠓⠕⠏⠉⠕⠙⠑
Post 30 Apr 2013, 08:47
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20430
Location: In your JS exploiting you and your system
revolution 30 Apr 2013, 09:26
hopcode wrote:
Code:
    .again:
        mov     ecx,dword[address+0]
        mov     edx,dword[address+4]
        mov     eax,dword[address+0]
        cmp     ecx,eax
        jnz     .again
    

this may have flaws too, for the same reason i agreed with you above about carry propagation on hi address. and you have here
the latency of 2 more not-locking-MOV (danger!! Wink ) and a JNZ.
the method using 2-LOCK and JNC has the disadvantage to assume the hi register is correct. this has lead me to think that we are only trying
good programming design, on the other side neglecting preventing an eventual bad modification/access to a resource (the 64bit counter ).
the advantages are but worth to be considered.
Thanks for your comment. Although I did mention above that this code is accurate as long as the write function is atomic. The two parts must go together. An atomic write and the above read code form a coherent and accurate 64-bit update system.
Post 30 Apr 2013, 09:26
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20430
Location: In your JS exploiting you and your system
revolution 30 Apr 2013, 09:58
Perhaps some explanation is on order. I'll try my best to show what I mean:

The reader code is always subject to reading an older (out of date) value for the counter. This is true whether or not we use a hardware atomic 64-bit read instruction or if we use split 32-bit reads. We deal with the case of an old value later when the counter is written by use of the lock cmpxchg8b instruction.

When reading it is not necessary to check the high order value for changes because we only read it once and it will either be a current value or an older value. Either way the low order value will show any change that is not a multiple of 2^32 and the high order value will be either the current value or an older value if the change is a multiple of 2^32. Once again, if we have read an older value then we deal with that when the counter is written again by use of the lock cmpxchg8b instruction.
Post 30 Apr 2013, 09:58
View user's profile Send private message Visit poster's website Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode 30 Apr 2013, 13:50
Quote:
unsafe Count:146134917961604096 Time:109
unsafe2 Count:200000000000000000 Time:1092
oldsafe Count:200000000000000000 Time:1935
safe Count:200000000000000000 Time:1529

the trick is good and works faster. but i think your older proc, 2 x cmpxchg8, is the safest one. the penalty/advantage using cmpxchg8 is because it is atomical on all 64bit adresses.
perhaps internally it executes more than 2-locks (on read and write, as from the manual); for this reason i find it not so performant as it should be.
anyway having a huge vector to update, PI calculation for example, the JNC beetweeen 2 lock-adds may improve speed logarithmic.
Cheers,
Very Happy

_________________
⠓⠕⠏⠉⠕⠙⠑
Post 30 Apr 2013, 13:50
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.