flat assembler
Message board for the users of flat assembler.

flat assembler > Examples and Tutorials > Accurate multi-threaded 64-bit counters on a 32-bit machine

Goto page 1, 2, 3  Next
Author
Thread Post new topic Reply to topic
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 16702
Location: In your JS exploiting you and your system
The most obvious way to solve the problem of atomic 64-bit updates to memory values from multiple CPUs is to use the OS synchronisation functions. They will work, but they can be very expensive.

A much more efficient method is to use cmpxchg8b (available on Pentium and later CPUs).

So without much further ado, here is an example:

file: atomic64.inc
Code:
macro atomic_64_bit_read address 
        ;thread safe read 64-bit memory location. read edxeax from memory
        ;ebx and ecx are corrupted
        mov     ebx,eax         ;in case of match place back the same value non-destructive read
        mov     ecx,edx         ;in case of match place back the same value non-destructive read
        lock cmpxchg8b address  ;value is returned in edxeax edx is high order 32 bits, eax is low order 32 bits


macro atomic_64_bit_write address 
        ;thread safe write 64-bit memory location. write ecxebx to memory ecx is high order 32 bits, ebx is low order 32 bits
        ;edx and eax are corrupted
        local   .retry
    .retry
        lock cmpxchg8b address  ;write ecxebx to memory if edxeax matches current value
        jnz     .retry          ;if no match then try again with new value in edxeax


macro atomic_64_bit_op address,op1,op2 
        ;thread safe operation of ediesi with the 64-bit memory location edi is high order 32 bits, esi is low order 32 bits
        ;eax, ebx, ecx and edx are corrupted
        ;firstly read the value in the memory location
        local   .retry
        atomic_64_bit_read address
    .retry
        ;transfer to working registers ecxebx
        mov     ebx,eax
        mov     ecx,edx
        ;do the requested operation
        op1     ebx,esi
        op2     ecx,edi
        ;try to write it back
        lock cmpxchg8b address  ;write ecxebx to memory only if edxeax matches current value i.e. no one else has changed the value
        jnz     .retry          ;if the memory value has changed then try again with new value in edxeax


macro atomic_64_bit_add address  atomic_64_bit_op address,add,adc 
macro atomic_64_bit_sub address  atomic_64_bit_op address,sub,sbb 
macro atomic_64_bit_xor address  atomic_64_bit_op address,xor,xor 
macro atomic_64_bit_or  address  atomic_64_bit_op address, or, or 
macro atomic_64_bit_and address  atomic_64_bit_op address,and,and 
macro atomic_64_bit_inc address 
        xor     edi,edi
        mov     esi,1
        atomic_64_bit_add address

macro atomic_64_bit_not address 
        or      edi,-1
        or      esi,-1
        atomic_64_bit_xor address
    
And a little windows test code to see if it works:
Code:
include 'win32ax.inc'
include 'atomic64.inc'

TEST_COUNT = 1000000

.data

        counter dq ?
        thread1 dd ?
        thread2 dd ?
        message rb 1024

.code

start
        stdcall run_test,unsafe_thread
        stdcall run_test,safe_thread
        invoke  MessageBox,0,message,'Safe or not?',0
        invoke  ExitProcess,0

proc run_test uses edi ebx,address
        xor     ecx,ecx
        xor     ebx,ebx
        atomic_64_bit_write counter
        invoke  GetTickCount
        mov     edi,eax
        invoke  CreateThread,NULL,0,address,TEST_COUNT,NULL,thread1
        mov     thread1,eax
        invoke  CreateThread,NULL,0,address,TEST_COUNT,NULL,thread2
        mov     thread2,eax
        invoke  WaitForMultipleObjects,2,thread1,TRUE,-1
        invoke  GetTickCount
        sub     edi,eax
        neg     edi
        invoke  lstrlen,message
        invoke  wsprintf,addr message+eax,<'Count%I64u Time%u',13,10>,double counter,edi
        ret
endp

proc safe_thread count
    .loop
        atomic_64_bit_inc counter
        dec     count
        jnz     .loop
        ret
endp

proc unsafe_thread count
    .loop
        mov     eax,dwordcounter+0
        mov     edx,dwordcounter+4
        add     eax,1
        adc     edx,0
        mov     dwordcounter+0,eax
        mov     dwordcounter+4,edx
        dec     count
        jnz     .loop
        ret
endp

.end start    
Naturally there is a price to pay to get the correct result:
Code:
Count1150857 Time47
Count2000000 Time187    
The second count is the correct value 2000000. The first count is from the unsafe un-synchronised thread.


Last edited by revolution on 07 Mar 2013, 14:59; edited 2 times in total
Post 07 Mar 2013, 12:34
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2792
Location: dank orb
Code:
proc unsafe_thread count
        mov eax,1
    .loop
        lock
        add dword counter+0,eax
        jnc @F
        add dword counter+4,eax
@@
        dec     count
        jnz     .loop
        ret
endp    
...I wasn't able to break it with 16 threads.
Doesn't mean anything though.


Last edited by bitRAKE on 07 Mar 2013, 15:05; edited 1 time in total
Post 07 Mar 2013, 14:53
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 16702
Location: In your JS exploiting you and your system
bitRAKE wrote:
Code:
proc unsafe_thread count
    .loop
        mov eax,1
        lock
        xadd dword counter+0,eax
        jnc @F
        add dword counter+4,1
@@
        dec     count
        jnz     .loop
        ret    
...I wasn't able to break it with 16 threads.
Doesn't mean anything though.
How are you protecting the high order read? Try your test with an adder value of 1000000000 instead of 1.
Post 07 Mar 2013, 15:02
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 16702
Location: In your JS exploiting you and your system
using xadd fails:
Code:
Count1982888850292736 Time78
Count2000000000000000 Time187    


Edit: Test code fragment:
Code:
ADDER_VALUE     = 1000000000

proc safe_thread count
        mov     esi,ADDER_VALUE and 0xffffffff
        mov     edi,ADDER_VALUE shr 32
    .loop
        atomic_64_bit_add counter
        dec     count
        jnz     .loop
        ret
endp

proc unsafe_thread_bitRAKE count
    .loop
        mov     eax,ADDER_VALUE
        lock
        xadd    dwordcounter+0,eax
        jnc     .skip
        add     dwordcounter+4,1
    .skip
        dec     count
        jnz     .loop
        ret
endp    


Last edited by revolution on 07 Mar 2013, 15:09; edited 1 time in total
Post 07 Mar 2013, 15:04
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2792
Location: dank orb
No, I only post INC thread. Smile
(You change it to make it fail.)
Post 07 Mar 2013, 15:07
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 16702
Location: In your JS exploiting you and your system
bitRAKE wrote:
(You change it to make it fail.)
No, I only showed that for a process that runs long enough you can still get a collision in the high order part. So that proc is still unsafe and not reliable. Indeed I can't see how any sequence of 32-bit operations can be made truly safe. You might be lucky and manage to avoid any failures during testing, but that would be asking for future weird bugs to frustrate you.
Post 07 Mar 2013, 15:17
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2792
Location: dank orb
It's impossible to design against any failures. Each problem should be assessed for the level of tolerance desired in that situation. We have not designed a dependent thread which gets held up, and we lack the processing power to overflow a 32-bit number in sufficient time. (with INC)
Post 07 Mar 2013, 15:20
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 16702
Location: In your JS exploiting you and your system
I'm designing against program failures. The code is functionally correct. We can't ask for more than that from code. If the surrounding CPU system has hardware problems then no amount of code can hope to correct for such things. Intel/AMD guarantee the functional correctness of "lock cmpxchg8b" by design. But they can't vouch for someone's flaky PSU altering the memory read values.
Post 07 Mar 2013, 15:28
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2792
Location: dank orb
The lock cmpxchg8b method is considerably faster than using xchg:
Code:
_counter dd counter

proc safe_thread count

.loop
        xor ecx,ecx
@@     xchg ecx,_counter
        jecxz @B

        add dword ecx+0,1
        adc dword ecx+4,0

        mov _counter,ecx

        dec     count
        jnz     .loop
        ret
endp
; cmpx 4302467295 Time842
; xchg 4302467295 Time2090    
Post 07 Mar 2013, 15:49
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 16702
Location: In your JS exploiting you and your system
bitRAKE wrote:
The lock cmpxchg8b method is considerably faster than using xchg:
As we would hope it should be. Using a standard locking mechanism like your code above is the usual way people have solved for this type of situation. But they are expensive and require an extra global variable for each protected 64-bit value.
Post 08 Mar 2013, 00:18
View user's profile Send private message Visit poster's website Reply with quote
randall



Joined: 03 Dec 2011
Posts: 155
Location: Poland
Very fast and simple atomic increment:

Code:
mov eax, 1
lock xadd mem, eax
    


Adds one to [mem] and returns previous value stored in [mem].
Post 31 Mar 2013, 02:58
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 16702
Location: In your JS exploiting you and your system
randall wrote:
Very fast and simple atomic increment:

Code:
mov eax, 1
lock xadd mem, eax
    


Adds one to [mem] and returns previous value stored in [mem].
Yes, for 32-bit counters it is fine.
Post 31 Mar 2013, 03:03
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2792
Location: dank orb
This lockless method is consistently faster on my system. Not as complex as it looks - bit lock on pieces and partitioned the counter into 16 pieces. Each thread operates on the opposite bit state. This is done so that each thread isn't just counting by itself - that would be too easy. So, there is actual forced contention for resources.

Edit: I should also say that it's guaranteed to work because of memory ordering rules on the bus.


Description: Lockless as in no LOCK instructions needed.
(Or XCHG instruction.)

Download
Filename: lockless.zip
Filesize: 2.83 KB
Downloaded: 136 Time(s)


_________________
¯\(°_o)/¯ unlicense.org
Post 26 Apr 2013, 12:55
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 16702
Location: In your JS exploiting you and your system
bitRAKE wrote:
This lockless method is consistently faster on my system. Not as complex as it looks - bit lock on pieces and partitioned the counter into 16 pieces. Each thread operates on the opposite bit state. This is done so that each thread isn't just counting by itself - that would be too easy. So, there is actual forced contention for resources.
Okay, but I think it is not strictly lockless because you have a bit variable acting as a lock protecting the counter. Unless I missed something?

Aside: I usually find that unless a new method is provably correct (or at least equal to the previous method it replaces) then the small increase in speed we might temporarily get is easily more than outweighed by the enormous loss of productivity we encounter when the new method produces incorrect results.
Post 26 Apr 2013, 13:16
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2792
Location: dank orb
Sorry, I made an edit. It is guaranteed to work. Vol3, Ch8 in the manual explains the memory ordering rules. It's lockless in that a LOCK# signal is not sent across the bus. Which can have a huge impact in performance. It's not a small gain - it can be over twice faster.
Post 26 Apr 2013, 13:22
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 16702
Location: In your JS exploiting you and your system
Oh, I just saw this.
bitRAKE wrote:
Edit: I should also say that it's guaranteed to work because of memory ordering rules on the bus.
Wouldn't this make your code hardware dependant? I'm not particularly confident that a 386 CPU (with a 32-bit bus) will follow the rules you appear to be implying.
Post 26 Apr 2013, 13:24
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 16702
Location: In your JS exploiting you and your system
bitRAKE wrote:
Sorry, I made an edit. It is guaranteed to work. Vol3, Ch8 in the manual explains the memory ordering rules. It's lockless in that a LOCK# signal is not sent across the bus. Which can have a huge impact in performance. It's not a small gain - it can be over twice faster.
Is this also true for multi-socket mobos? What about NUMA systems?

I don't care about the performance gain if it can only be guaranteed to work on a subset of systems.
Post 26 Apr 2013, 13:27
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2792
Location: dank orb
Unless Intel is lying in the manual then it should work in general (Pentium+). I've been using Dual socket mobos since my overclocked BP6.

_________________
¯\(°_o)/¯ unlicense.org
Post 26 Apr 2013, 13:45
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 16702
Location: In your JS exploiting you and your system
bitRAKE wrote:
Unless Intel is lying in the manual then it should work in general (Pentium+). I've been using Dual socket mobos since my overclocked BP6.
I think there is a problem then. Not all mobos are made the same with regard to multi-CPU sockets.

Anyhow, I also think you misread the manual with regard to bus ordering. I can't find anywhere that is stated that separated reads and writes are guaranteed not to be interleaved between threads. Also, the split 32-bit reads and writes you use have no guarantee that another thread or CPU won't take the bus in between operations. Aligning the data only guarantees the basic operation is atomic. There is no guarantee that multiple operations of smaller size are implicitly locked between accesses. The fact that you couldn't fault it with your test does not mean that it is working the way that you think it is.
Post 26 Apr 2013, 14:14
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2792
Location: dank orb
Using the notation from the manual with a slight modification:
Code:
READ A.1.1
READ A.1.0
        READ B.1.0
        READ C.1.0
        WRITE B.1.0
        WRITE C.1.0
        READ A.1.0
        WRITE A.1.0

READ A.2.0
READ A.2.1
        READ B.2.1
        READ C.2.1
        WRITE B.2.1
        WRITE C.2.1
        READ A.2.1
        WRITE A.2.1    
...where READ.A.2.1 means a read operation at location A, processor 2, and lock bit is 1 (from that processors perspective, of course). As long as processor ordering is preserved, what is the ordering (shuffling) which produces an error?


Last edited by bitRAKE on 26 Apr 2013, 15:21; edited 1 time in total
Post 26 Apr 2013, 15:07
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2, 3  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2019, Tomasz Grysztar.

Powered by rwasa.