flat assembler
Message board for the users of flat assembler.
Index
> Tutorials and Examples > Accurate multi-threaded 64-bit counters on a 32-bit machine Goto page Previous 1, 2, 3 Next |
Author |
|
revolution 26 Apr 2013, 15:37
Okay I looked further into your code. It looks semantically correct for the two thread scenario and I would expect it to function well. But you incorrectly describe it as lockless when it is in fact using locks for each counter. The locks are contained in the bitmap.
However there appears to be a large flaw with trying to use it on more than two threads/CPUs at one time. Your code appears to not be extendible to an arbitrary number of threads. You could of course alter it for more threads but then it devolves into the standard data-store-with-an-associated-external-lock scenario. |
|||
26 Apr 2013, 15:37 |
|
bitRAKE 26 Apr 2013, 21:16
It does work generally.
1) threads only signal one other thread 2) last write is signal update This insures the sequential operation. Although the write order between processors is not guaranteed, the write order of a single processor is ordered. (Speaking about the main memory bus - multi-chip/core.) So, this describes a sequential fragmented stream of instructions - not a parallel process. It is a lock in that sense, but not a typical lock. It would be advantageous to have several such sequential streams running in parallel. Not sending the bus lock is important. |
|||
26 Apr 2013, 21:16 |
|
revolution 28 Apr 2013, 08:42
bitRAKE wrote: It does work generally. Code: ;... TEST_COUNT1 = 1000 TEST_COUNT2 = 9000 ;... invoke CreateThread,NULL,0,[address#i],TEST_COUNT#i,NULL,thread#i ;... Also I would be keen to see how you would make the code use arbitrary thread counts. |
|||
28 Apr 2013, 08:42 |
|
hopcode 29 Apr 2013, 14:35
bitRAKE's solution is in all cases faster, granted locking is on all involved memory addresses.
here using 8 threads, different counts, and adder value on my Quad Yorkfield Quote: revolution bitRAKE solution should be the choice for big numbers in mem. that 35 cycles branch overhead from JNC is worth more than 2 consecutive locks. having prefetch then on larger memory, it can improve 10%~15% on the best result here. Cheers, _________________ ⠓⠕⠏⠉⠕⠙⠑ |
|||
29 Apr 2013, 14:35 |
|
revolution 29 Apr 2013, 14:41
hopcode: Can you show your code please. I couldn't get bitRAKE's code to work with more then two threads or with independent counts on each thread.
|
|||
29 Apr 2013, 14:41 |
|
hopcode 29 Apr 2013, 14:52
Code: ;--- DATA NUMT equ 8 rept NUMT i { thread#i dd ? } sig_start dd 0 ;--- CODE rept NUMT i { invoke CreateThread,NULL,0,[address],TEST_COUNT+i*10000,NULL,thread#i mov [thread#i],eax } mov [sig_start],1 invoke WaitForMultipleObjects,NUMT,thread1,TRUE,-1 Code: proc safe_thread count .start: lock and [sig_start],1 ;--- avoid ResumeThread overhead jz .start mov esi,1000000000 ;--- possible adder value .loop: lock add dword[counter+0],esi jnc @f lock adc dword[counter+4],0 ;--- lock this only when needed ;--- @add64 [counter] ;--- revolution renamed macro @@: dec [count] jnz .loop ret endp should work fine _________________ ⠓⠕⠏⠉⠕⠙⠑ |
|||
29 Apr 2013, 14:52 |
|
revolution 29 Apr 2013, 15:02
hopcode: Your "safe_thread" is not safe at all. You can't use 32-bit code like that without some form of lock to prevent other threads reading the half updated values.
|
|||
29 Apr 2013, 15:02 |
|
hopcode 29 Apr 2013, 15:05
agree, because there is no signaling during carry propagation.
_________________ ⠓⠕⠏⠉⠕⠙⠑ |
|||
29 Apr 2013, 15:05 |
|
revolution 29 Apr 2013, 15:07
hopcode: Can you show your solution with bitRAKE's code for more than 2 threads and independent counters please? I am curious to see how you extended to get it to work.
|
|||
29 Apr 2013, 15:07 |
|
hopcode 29 Apr 2013, 15:23
i was working on your solution, because bitmaps "delays" the problem to another stage, and it may be too much contextual, where LOCK is already atomic on memory addresses.
anyway i doubt there is a better solution than the 2 locks+JNC as from here, http://board.flatassembler.net/topic.php?p=155105#155105 it is not that bad i think, considering your macro does lock-read + modify + lock-write edit:... a moment, perhaps i am missing something _________________ ⠓⠕⠏⠉⠕⠙⠑ |
|||
29 Apr 2013, 15:23 |
|
revolution 29 Apr 2013, 15:44
Oh, you were working from the older code. I thought you were using bitRAKE's later code.
|
|||
29 Apr 2013, 15:44 |
|
hopcode 29 Apr 2013, 16:14
no, because the BTS doesnt assert LOCK intrinsically, and for this reason the method should be avoided even when working on little bitmaps, just because LOCK is atomical and may invalidate the cacheline, while the opposite is false.
in fact trying prefetch to avoid locking on datas in the same lines that do not require update doesnt work. the correct is the second code from bitRAKE, but using 2 locks. now confirmed, using 8 different counter too: same timing as above (relatively) _________________ ⠓⠕⠏⠉⠕⠙⠑ |
|||
29 Apr 2013, 16:14 |
|
bitRAKE 29 Apr 2013, 18:49
Sorry, I haven't gotten back to this. I'm working on a general solution. First it will build a model of computation and then apply a concurrency model to execute it with arbitrary threads. So, we can better test specific types of processing (memory intensive, processor intensive, thread dependencies etc.); and specific models of concurrency.
We should make a distinction between the software lock, the lock instruction and the hardware lock# signal. The software lock allows the memory controller to do it's job uninterrupted, and that is why it's faster (even when there is memory contention between cores). _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
29 Apr 2013, 18:49 |
|
hopcode 30 Apr 2013, 05:36
agree with bitRAKE, needed general guidelines summarizing the matter. from several manuals.
example CHAPTER 8 MULTIPLE-PROCESSOR MANAGEMENT from Intel® 64 and IA-32 Architectures Software Developer’s Manual Combined Volumes:1, 2A, 2B, 2C, 3A, 3B, and 3C ( http://download.intel.com/products/processor/manual/325462.pdf ) explaining something but that is not all. it's not simple to find an explanation why it works/does-not-work without testing it, and this should be the goal, imo we may find XCHG good because asserting LOCK on older processor, or 2 consecutive locked instructions working fine on newer ones because of cache mechanisms, and considering the privilege level to low to modify in protected mode the way to access memory (WP/WC etc) ehm, that job is hard...but worth Cheers, _________________ ⠓⠕⠏⠉⠕⠙⠑ |
|||
30 Apr 2013, 05:36 |
|
revolution 30 Apr 2013, 06:57
I've updated the 64-bit read macro to not use the bus lock to read the value. Instead I use the trick of reading one of the memory contents twice and comparing to see if it changed. Note that this only works correctly when the 64-bit writes are guaranteed atomic so the 'lock cmpxchg8b' instruction still needs to be used for writes to make this work.
Plus I've updated the test to allow arbitrary thread counts by setting an assembly time constant, and combined it all into a single file for testing. Note that the function named 'unsafe_thread_2' can't be broken with this test code but it is still a broken implementation of updating a 64-bit value atomically. A different version of the test code could expose the flaw when both the low and high parts of the 64-bit value are required before the result is written back. A simple adder, as used here, will in fact work correctly and the code is faster than both versions of the safe_thread functions. For some circumstances, if the speed difference is important, then then 'unsafe_thread_2' could be suitable for use, but the use restrictions must be properly considered. Code: TEST_COUNT = 10000000 ADDER_VALUE = 10000000000 THREAD_COUNT = 2 include 'win32ax.inc' macro old_atomic_64_bit_read address { ;thread safe read 64-bit memory location. read edx:eax from memory ;ebx and ecx are corrupted mov ebx,eax ;in case of match place back the same value (non-destructive read) mov ecx,edx ;in case of match place back the same value (non-destructive read) lock cmpxchg8b [address] ;value is returned in edx:eax (edx is high order 32 bits, eax is low order 32 bits) } macro old_atomic_64_bit_write address { ;thread safe write 64-bit memory location. write ecx:ebx to memory (ecx is high order 32 bits, ebx is low order 32 bits) ;edx and eax are corrupted local .retry .retry: lock cmpxchg8b [address] ;write ecx:ebx to memory if edx:eax matches current value jnz .retry ;if no match then try again with new value in edx:eax } macro old_atomic_64_bit_op address,op1,op2 { ;thread safe operation of edi:esi with the 64-bit memory location (edi is high order 32 bits, esi is low order 32 bits) ;eax, ebx, ecx and edx are corrupted ;firstly read the value in the memory location local .retry old_atomic_64_bit_read address .retry: ;transfer to working registers ecx:ebx mov ebx,eax mov ecx,edx ;do the requested operation op1 ebx,esi op2 ecx,edi ;try to write it back lock cmpxchg8b [address] ;write ecx:ebx to memory only if edx:eax matches current value (i.e. no one else has changed the value) jnz .retry ;if the memory value has changed then try again with new value in edx:eax } macro old_atomic_64_bit_add address { old_atomic_64_bit_op address,add,adc } macro old_atomic_64_bit_sub address { old_atomic_64_bit_op address,sub,sbb } macro old_atomic_64_bit_xor address { old_atomic_64_bit_op address,xor,xor } macro old_atomic_64_bit_or address { old_atomic_64_bit_op address, or, or } macro old_atomic_64_bit_and address { old_atomic_64_bit_op address,and,and } macro old_atomic_64_bit_inc address { xor edi,edi mov esi,1 old_atomic_64_bit_add address } macro old_atomic_64_bit_not address { or edi,-1 or esi,-1 old_atomic_64_bit_xor address } ;newer macros below macro atomic_64_bit_read address { ;thread safe read 64-bit memory location. read edx:eax from memory ;ecx is corrupted local .again .again: mov ecx,dword[address+0] mov edx,dword[address+4] mov eax,dword[address+0] cmp ecx,eax jnz .again } macro atomic_64_bit_write address { ;thread safe write 64-bit memory location. write ecx:ebx to memory (ecx is high order 32 bits, ebx is low order 32 bits) ;edx and eax are corrupted local .retry .retry: lock cmpxchg8b [address] ;write ecx:ebx to memory if edx:eax matches current value jnz .retry ;if no match then try again with new value in edx:eax } macro atomic_64_bit_op address,op1,op2 { ;thread safe operation of edi:esi with the 64-bit memory location (edi is high order 32 bits, esi is low order 32 bits) ;eax, ebx, ecx and edx are corrupted ;firstly read the value in the memory location local .retry atomic_64_bit_read address .retry: ;transfer to working registers ecx:ebx mov ebx,eax mov ecx,edx ;do the requested operation op1 ebx,esi op2 ecx,edi ;try to write it back lock cmpxchg8b [address] ;write ecx:ebx to memory only if edx:eax matches current value (i.e. no one else has changed the value) jnz .retry ;if the memory value has changed then try again with new value in edx:eax } macro atomic_64_bit_add address { atomic_64_bit_op address,add,adc } macro atomic_64_bit_sub address { atomic_64_bit_op address,sub,sbb } macro atomic_64_bit_xor address { atomic_64_bit_op address,xor,xor } macro atomic_64_bit_or address { atomic_64_bit_op address, or, or } macro atomic_64_bit_and address { atomic_64_bit_op address,and,and } macro atomic_64_bit_inc address { xor edi,edi mov esi,1 atomic_64_bit_add address } macro atomic_64_bit_not address { or edi,-1 or esi,-1 atomic_64_bit_xor address } .data counter dq ? threads rd THREAD_COUNT message rb 1 shl 12 .code start: stdcall run_test,unsafe_thread,'unsafe' stdcall run_test,unsafe_thread_2,'unsafe2' stdcall run_test,old_safe_thread,'old safe' stdcall run_test,safe_thread,'safe' invoke MessageBox,0,message,'Safe or not?',0 invoke ExitProcess,0 proc run_test uses esi edi ebx,address,name xor ecx,ecx xor ebx,ebx atomic_64_bit_write counter invoke GetTickCount mov edi,eax mov esi,threads .thread_loop: invoke CreateThread,NULL,0,[address],TEST_COUNT,NULL,esi mov [esi],eax add esi,4 cmp esi,threads+THREAD_COUNT*4 jnz .thread_loop invoke WaitForMultipleObjects,THREAD_COUNT,threads,TRUE,-1 invoke GetTickCount sub edi,eax neg edi invoke lstrlen,message invoke wsprintf,addr message+eax,<'%s',9,'Count:%I64u Time:%u',13,10>,[name],double [counter],edi ret endp proc unsafe_thread count mov esi,ADDER_VALUE and 0xffffffff mov edi,ADDER_VALUE shr 32 .loop: mov eax,dword[counter+0] mov edx,dword[counter+4] add eax,esi adc edx,edi mov dword[counter+0],eax mov dword[counter+4],edx dec [count] jnz .loop ret endp proc unsafe_thread_2 count mov esi,ADDER_VALUE and 0xffffffff mov edi,ADDER_VALUE shr 32 .loop: lock add dword[counter+0],esi lock adc dword[counter+4],edi dec [count] jnz .loop ret endp proc old_safe_thread count mov esi,ADDER_VALUE and 0xffffffff mov edi,ADDER_VALUE shr 32 .loop: old_atomic_64_bit_add counter dec [count] jnz .loop ret endp proc safe_thread count mov esi,ADDER_VALUE and 0xffffffff mov edi,ADDER_VALUE shr 32 .loop: atomic_64_bit_add counter dec [count] jnz .loop ret endp .end start Code: unsafe Count:132691383273074688 Time:374 unsafe2 Count:200000000000000000 Time:811 oldsafe Count:200000000000000000 Time:1857 safe Count:200000000000000000 Time:1606 |
|||
30 Apr 2013, 06:57 |
|
hopcode 30 Apr 2013, 08:47
Code: .again: mov ecx,dword[address+0] mov edx,dword[address+4] mov eax,dword[address+0] cmp ecx,eax jnz .again this may have flaws too, for the same reason i agreed with you above about carry propagation on hi address. and you have here the latency of 2 more not-locking-MOV (danger!! ) and a JNZ. the method using 2-LOCK and JNC has the disadvantage to assume the hi register is correct. this has lead me to think that we are only trying good programming design, on the other side neglecting preventing an eventual bad modification/access to a resource (the 64bit counter ). the advantages are but worth to be considered. in my further tests, i run 2 threads safe+unsafe at the same time, and none of the solutions above obviously works on the counter. you may say it is a matter of further design in the safe proc, ok. but the unsafe thread may be an external user's code allowed to access the counter. and this is a frequent real-case. also i wonder whether there is an abstract method to share a resource and check for its safety during access/modification from user-end-thread with very few code. i'll try later today the updates, Cheers, _________________ ⠓⠕⠏⠉⠕⠙⠑ |
|||
30 Apr 2013, 08:47 |
|
revolution 30 Apr 2013, 09:26
hopcode wrote:
|
|||
30 Apr 2013, 09:26 |
|
revolution 30 Apr 2013, 09:58
Perhaps some explanation is on order. I'll try my best to show what I mean:
The reader code is always subject to reading an older (out of date) value for the counter. This is true whether or not we use a hardware atomic 64-bit read instruction or if we use split 32-bit reads. We deal with the case of an old value later when the counter is written by use of the lock cmpxchg8b instruction. When reading it is not necessary to check the high order value for changes because we only read it once and it will either be a current value or an older value. Either way the low order value will show any change that is not a multiple of 2^32 and the high order value will be either the current value or an older value if the change is a multiple of 2^32. Once again, if we have read an older value then we deal with that when the counter is written again by use of the lock cmpxchg8b instruction. |
|||
30 Apr 2013, 09:58 |
|
hopcode 30 Apr 2013, 13:50
Quote: unsafe Count:146134917961604096 Time:109 the trick is good and works faster. but i think your older proc, 2 x cmpxchg8, is the safest one. the penalty/advantage using cmpxchg8 is because it is atomical on all 64bit adresses. perhaps internally it executes more than 2-locks (on read and write, as from the manual); for this reason i find it not so performant as it should be. anyway having a huge vector to update, PI calculation for example, the JNC beetweeen 2 lock-adds may improve speed logarithmic. Cheers, _________________ ⠓⠕⠏⠉⠕⠙⠑ |
|||
30 Apr 2013, 13:50 |
|
Goto page Previous 1, 2, 3 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.