flat assembler
Message board for the users of flat assembler.

Index > Tutorials and Examples > Accurate multi-threaded 64-bit counters on a 32-bit machine

Goto page 1, 2, 3  Next
Author
Thread Post new topic Reply to topic
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 07 Mar 2013, 12:34
The most obvious way to solve the problem of atomic 64-bit updates to memory values from multiple CPUs is to use the OS synchronisation functions. They will work, but they can be very expensive.

A much more efficient method is to use cmpxchg8b (available on Pentium and later CPUs).

So without much further ado, here is an example:

file: atomic64.inc
Code:
macro atomic_64_bit_read address {
        ;thread safe read 64-bit memory location. read edx:eax from memory
        ;ebx and ecx are corrupted
        mov     ebx,eax         ;in case of match place back the same value (non-destructive read)
        mov     ecx,edx         ;in case of match place back the same value (non-destructive read)
        lock cmpxchg8b address  ;value is returned in edx:eax (edx is high order 32 bits, eax is low order 32 bits)
}

macro atomic_64_bit_write address {
        ;thread safe write 64-bit memory location. write ecx:ebx to memory (ecx is high order 32 bits, ebx is low order 32 bits)
        ;edx and eax are corrupted
        local   .retry
    .retry:
        lock cmpxchg8b address  ;write ecx:ebx to memory if edx:eax matches current value
        jnz     .retry          ;if no match then try again with new value in edx:eax
}

macro atomic_64_bit_op address,op1,op2 {
        ;thread safe operation of edi:esi with the 64-bit memory location (edi is high order 32 bits, esi is low order 32 bits)
        ;eax, ebx, ecx and edx are corrupted
        ;firstly read the value in the memory location
        local   .retry
        atomic_64_bit_read address
    .retry:
        ;transfer to working registers ecx:ebx
        mov     ebx,eax
        mov     ecx,edx
        ;do the requested operation
        op1     ebx,esi
        op2     ecx,edi
        ;try to write it back
        lock cmpxchg8b address  ;write ecx:ebx to memory only if edx:eax matches current value (i.e. no one else has changed the value)
        jnz     .retry          ;if the memory value has changed then try again with new value in edx:eax
}

macro atomic_64_bit_add address { atomic_64_bit_op address,add,adc }
macro atomic_64_bit_sub address { atomic_64_bit_op address,sub,sbb }
macro atomic_64_bit_xor address { atomic_64_bit_op address,xor,xor }
macro atomic_64_bit_or  address { atomic_64_bit_op address, or, or }
macro atomic_64_bit_and address { atomic_64_bit_op address,and,and }
macro atomic_64_bit_inc address {
        xor     edi,edi
        mov     esi,1
        atomic_64_bit_add address
}
macro atomic_64_bit_not address {
        or      edi,-1
        or      esi,-1
        atomic_64_bit_xor address
}    
And a little windows test code to see if it works:
Code:
include 'win32ax.inc'
include 'atomic64.inc'

TEST_COUNT = 1000000

.data

        counter dq ?
        thread1 dd ?
        thread2 dd ?
        message rb 1024

.code

start:
        stdcall run_test,unsafe_thread
        stdcall run_test,safe_thread
        invoke  MessageBox,0,message,'Safe or not?',0
        invoke  ExitProcess,0

proc run_test uses edi ebx,address
        xor     ecx,ecx
        xor     ebx,ebx
        atomic_64_bit_write [counter]
        invoke  GetTickCount
        mov     edi,eax
        invoke  CreateThread,NULL,0,[address],TEST_COUNT,NULL,thread1
        mov     [thread1],eax
        invoke  CreateThread,NULL,0,[address],TEST_COUNT,NULL,thread2
        mov     [thread2],eax
        invoke  WaitForMultipleObjects,2,thread1,TRUE,-1
        invoke  GetTickCount
        sub     edi,eax
        neg     edi
        invoke  lstrlen,message
        invoke  wsprintf,addr message+eax,<'Count:%I64u Time:%u',13,10>,double [counter],edi
        ret
endp

proc safe_thread count
    .loop:
        atomic_64_bit_inc [counter]
        dec     [count]
        jnz     .loop
        ret
endp

proc unsafe_thread count
    .loop:
        mov     eax,dword[counter+0]
        mov     edx,dword[counter+4]
        add     eax,1
        adc     edx,0
        mov     dword[counter+0],eax
        mov     dword[counter+4],edx
        dec     [count]
        jnz     .loop
        ret
endp

.end start    
Naturally there is a price to pay to get the correct result:
Code:
Count:1150857 Time:47
Count:2000000 Time:187    
The second count is the correct value 2000000. The first count is from the unsafe un-synchronised thread.


Last edited by revolution on 07 Mar 2013, 14:59; edited 2 times in total
Post 07 Mar 2013, 12:34
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4020
Location: vpcmpistri
bitRAKE 07 Mar 2013, 14:53
Code:
proc unsafe_thread count
        mov eax,1
    .loop:
        lock
        add dword [counter+0],eax
        jnc @F
        add dword [counter+4],eax
@@:
        dec     [count]
        jnz     .loop
        ret
endp    
...I wasn't able to break it with 16 threads.
Doesn't mean anything though.


Last edited by bitRAKE on 07 Mar 2013, 15:05; edited 1 time in total
Post 07 Mar 2013, 14:53
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 07 Mar 2013, 15:02
bitRAKE wrote:
Code:
proc unsafe_thread count
    .loop:
        mov eax,1
        lock
        xadd dword [counter+0],eax
        jnc @F
        add dword [counter+4],1
@@:
        dec     [count]
        jnz     .loop
        ret    
...I wasn't able to break it with 16 threads.
Doesn't mean anything though.
How are you protecting the high order read? Try your test with an adder value of 1000000000 instead of 1.
Post 07 Mar 2013, 15:02
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 07 Mar 2013, 15:04
using xadd fails:
Code:
Count:1982888850292736 Time:78
Count:2000000000000000 Time:187    


Edit: Test code fragment:
Code:
ADDER_VALUE     = 1000000000

proc safe_thread count
        mov     esi,ADDER_VALUE and 0xffffffff
        mov     edi,ADDER_VALUE shr 32
    .loop:
        atomic_64_bit_add [counter]
        dec     [count]
        jnz     .loop
        ret
endp

proc unsafe_thread_bitRAKE count
    .loop:
        mov     eax,ADDER_VALUE
        lock
        xadd    dword[counter+0],eax
        jnc     .skip
        add     dword[counter+4],1
    .skip:
        dec     [count]
        jnz     .loop
        ret
endp    


Last edited by revolution on 07 Mar 2013, 15:09; edited 1 time in total
Post 07 Mar 2013, 15:04
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4020
Location: vpcmpistri
bitRAKE 07 Mar 2013, 15:07
No, I only post INC thread. Smile
(You change it to make it fail.)
Post 07 Mar 2013, 15:07
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 07 Mar 2013, 15:17
bitRAKE wrote:
(You change it to make it fail.)
No, I only showed that for a process that runs long enough you can still get a collision in the high order part. So that proc is still unsafe and not reliable. Indeed I can't see how any sequence of 32-bit operations can be made truly safe. You might be lucky and manage to avoid any failures during testing, but that would be asking for future weird bugs to frustrate you.
Post 07 Mar 2013, 15:17
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4020
Location: vpcmpistri
bitRAKE 07 Mar 2013, 15:20
It's impossible to design against any failures. Each problem should be assessed for the level of tolerance desired in that situation. We have not designed a dependent thread which gets held up, and we lack the processing power to overflow a 32-bit number in sufficient time. (with INC)
Post 07 Mar 2013, 15:20
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 07 Mar 2013, 15:28
I'm designing against program failures. The code is functionally correct. We can't ask for more than that from code. If the surrounding CPU system has hardware problems then no amount of code can hope to correct for such things. Intel/AMD guarantee the functional correctness of "lock cmpxchg8b" by design. But they can't vouch for someone's flaky PSU altering the memory read values.
Post 07 Mar 2013, 15:28
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4020
Location: vpcmpistri
bitRAKE 07 Mar 2013, 15:49
The lock cmpxchg8b method is considerably faster than using xchg:
Code:
_counter dd counter

proc safe_thread count

.loop:
        xor ecx,ecx
@@:     xchg ecx,[_counter]
        jecxz @B

        add dword [ecx+0],1
        adc dword [ecx+4],0

        mov [_counter],ecx

        dec     [count]
        jnz     .loop
        ret
endp
; cmpx: 4302467295 Time:842
; xchg: 4302467295 Time:2090    
Post 07 Mar 2013, 15:49
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 08 Mar 2013, 00:18
bitRAKE wrote:
The lock cmpxchg8b method is considerably faster than using xchg:
As we would hope it should be. Using a standard locking mechanism like your code above is the usual way people have solved for this type of situation. But they are expensive and require an extra global variable for each protected 64-bit value.
Post 08 Mar 2013, 00:18
View user's profile Send private message Visit poster's website Reply with quote
randall



Joined: 03 Dec 2011
Posts: 155
Location: Poland
randall 31 Mar 2013, 02:58
Very fast and simple atomic increment:

Code:
mov eax, 1
lock xadd [mem], eax
    


Adds one to [mem] and returns previous value stored in [mem].
Post 31 Mar 2013, 02:58
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 31 Mar 2013, 03:03
randall wrote:
Very fast and simple atomic increment:

Code:
mov eax, 1
lock xadd [mem], eax
    


Adds one to [mem] and returns previous value stored in [mem].
Yes, for 32-bit counters it is fine.
Post 31 Mar 2013, 03:03
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4020
Location: vpcmpistri
bitRAKE 26 Apr 2013, 12:55
This lockless method is consistently faster on my system. Not as complex as it looks - bit lock on pieces and partitioned the counter into 16 pieces. Each thread operates on the opposite bit state. This is done so that each thread isn't just counting by itself - that would be too easy. So, there is actual forced contention for resources.

Edit: I should also say that it's guaranteed to work because of memory ordering rules on the bus.


Description: Lockless as in no LOCK instructions needed.
(Or XCHG instruction.)

Download
Filename: lockless.zip
Filesize: 2.83 KB
Downloaded: 773 Time(s)


_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 26 Apr 2013, 12:55
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 26 Apr 2013, 13:16
bitRAKE wrote:
This lockless method is consistently faster on my system. Not as complex as it looks - bit lock on pieces and partitioned the counter into 16 pieces. Each thread operates on the opposite bit state. This is done so that each thread isn't just counting by itself - that would be too easy. So, there is actual forced contention for resources.
Okay, but I think it is not strictly lockless because you have a bit variable acting as a lock protecting the counter. Unless I missed something?

Aside: I usually find that unless a new method is provably correct (or at least equal to the previous method it replaces) then the small increase in speed we might temporarily get is easily more than outweighed by the enormous loss of productivity we encounter when the new method produces incorrect results.
Post 26 Apr 2013, 13:16
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4020
Location: vpcmpistri
bitRAKE 26 Apr 2013, 13:22
Sorry, I made an edit. It is guaranteed to work. Vol3, Ch8 in the manual explains the memory ordering rules. It's lockless in that a LOCK# signal is not sent across the bus. Which can have a huge impact in performance. It's not a small gain - it can be over twice faster.
Post 26 Apr 2013, 13:22
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 26 Apr 2013, 13:24
Oh, I just saw this.
bitRAKE wrote:
Edit: I should also say that it's guaranteed to work because of memory ordering rules on the bus.
Wouldn't this make your code hardware dependant? I'm not particularly confident that a 386 CPU (with a 32-bit bus) will follow the rules you appear to be implying.
Post 26 Apr 2013, 13:24
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 26 Apr 2013, 13:27
bitRAKE wrote:
Sorry, I made an edit. It is guaranteed to work. Vol3, Ch8 in the manual explains the memory ordering rules. It's lockless in that a LOCK# signal is not sent across the bus. Which can have a huge impact in performance. It's not a small gain - it can be over twice faster.
Is this also true for multi-socket mobos? What about NUMA systems?

I don't care about the performance gain if it can only be guaranteed to work on a subset of systems.
Post 26 Apr 2013, 13:27
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4020
Location: vpcmpistri
bitRAKE 26 Apr 2013, 13:45
Unless Intel is lying in the manual then it should work in general (Pentium+). I've been using Dual socket mobos since my overclocked BP6.

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 26 Apr 2013, 13:45
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 26 Apr 2013, 14:14
bitRAKE wrote:
Unless Intel is lying in the manual then it should work in general (Pentium+). I've been using Dual socket mobos since my overclocked BP6.
I think there is a problem then. Not all mobos are made the same with regard to multi-CPU sockets.

Anyhow, I also think you misread the manual with regard to bus ordering. I can't find anywhere that is stated that separated reads and writes are guaranteed not to be interleaved between threads. Also, the split 32-bit reads and writes you use have no guarantee that another thread or CPU won't take the bus in between operations. Aligning the data only guarantees the basic operation is atomic. There is no guarantee that multiple operations of smaller size are implicitly locked between accesses. The fact that you couldn't fault it with your test does not mean that it is working the way that you think it is.
Post 26 Apr 2013, 14:14
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4020
Location: vpcmpistri
bitRAKE 26 Apr 2013, 15:07
Using the notation from the manual with a slight modification:
Code:
READ A.1.1
READ A.1.0
        READ B.1.0
        READ C.1.0
        WRITE B.1.0
        WRITE C.1.0
        READ A.1.0
        WRITE A.1.0

READ A.2.0
READ A.2.1
        READ B.2.1
        READ C.2.1
        WRITE B.2.1
        WRITE C.2.1
        READ A.2.1
        WRITE A.2.1    
...where READ.A.2.1 means a read operation at location A, processor 2, and lock bit is 1 (from that processors perspective, of course). As long as processor ordering is preserved, what is the ordering (shuffling) which produces an error?


Last edited by bitRAKE on 26 Apr 2013, 15:21; edited 1 time in total
Post 26 Apr 2013, 15:07
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2, 3  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.