flat assembler
Message board for the users of flat assembler.
Index
> Tutorials and Examples > Accurate multi-threaded 64-bit counters on a 32-bit machine Goto page 1, 2, 3 Next |
Author |
|
bitRAKE 07 Mar 2013, 14:53
Code: proc unsafe_thread count mov eax,1 .loop: lock add dword [counter+0],eax jnc @F add dword [counter+4],eax @@: dec [count] jnz .loop ret endp Doesn't mean anything though. Last edited by bitRAKE on 07 Mar 2013, 15:05; edited 1 time in total |
|||
07 Mar 2013, 14:53 |
|
revolution 07 Mar 2013, 15:02
bitRAKE wrote:
|
|||
07 Mar 2013, 15:02 |
|
revolution 07 Mar 2013, 15:04
using xadd fails:
Code: Count:1982888850292736 Time:78 Count:2000000000000000 Time:187 Edit: Test code fragment: Code: ADDER_VALUE = 1000000000 proc safe_thread count mov esi,ADDER_VALUE and 0xffffffff mov edi,ADDER_VALUE shr 32 .loop: atomic_64_bit_add [counter] dec [count] jnz .loop ret endp proc unsafe_thread_bitRAKE count .loop: mov eax,ADDER_VALUE lock xadd dword[counter+0],eax jnc .skip add dword[counter+4],1 .skip: dec [count] jnz .loop ret endp Last edited by revolution on 07 Mar 2013, 15:09; edited 1 time in total |
|||
07 Mar 2013, 15:04 |
|
bitRAKE 07 Mar 2013, 15:07
No, I only post INC thread.
(You change it to make it fail.) |
|||
07 Mar 2013, 15:07 |
|
revolution 07 Mar 2013, 15:17
bitRAKE wrote: (You change it to make it fail.) |
|||
07 Mar 2013, 15:17 |
|
bitRAKE 07 Mar 2013, 15:20
It's impossible to design against any failures. Each problem should be assessed for the level of tolerance desired in that situation. We have not designed a dependent thread which gets held up, and we lack the processing power to overflow a 32-bit number in sufficient time. (with INC)
|
|||
07 Mar 2013, 15:20 |
|
revolution 07 Mar 2013, 15:28
I'm designing against program failures. The code is functionally correct. We can't ask for more than that from code. If the surrounding CPU system has hardware problems then no amount of code can hope to correct for such things. Intel/AMD guarantee the functional correctness of "lock cmpxchg8b" by design. But they can't vouch for someone's flaky PSU altering the memory read values.
|
|||
07 Mar 2013, 15:28 |
|
bitRAKE 07 Mar 2013, 15:49
The lock cmpxchg8b method is considerably faster than using xchg:
Code: _counter dd counter proc safe_thread count .loop: xor ecx,ecx @@: xchg ecx,[_counter] jecxz @B add dword [ecx+0],1 adc dword [ecx+4],0 mov [_counter],ecx dec [count] jnz .loop ret endp ; cmpx: 4302467295 Time:842 ; xchg: 4302467295 Time:2090 |
|||
07 Mar 2013, 15:49 |
|
revolution 08 Mar 2013, 00:18
bitRAKE wrote: The lock cmpxchg8b method is considerably faster than using xchg: |
|||
08 Mar 2013, 00:18 |
|
randall 31 Mar 2013, 02:58
Very fast and simple atomic increment:
Code: mov eax, 1 lock xadd [mem], eax Adds one to [mem] and returns previous value stored in [mem]. |
|||
31 Mar 2013, 02:58 |
|
revolution 31 Mar 2013, 03:03
randall wrote: Very fast and simple atomic increment: |
|||
31 Mar 2013, 03:03 |
|
bitRAKE 26 Apr 2013, 12:55
This lockless method is consistently faster on my system. Not as complex as it looks - bit lock on pieces and partitioned the counter into 16 pieces. Each thread operates on the opposite bit state. This is done so that each thread isn't just counting by itself - that would be too easy. So, there is actual forced contention for resources.
Edit: I should also say that it's guaranteed to work because of memory ordering rules on the bus.
_________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||||||||||
26 Apr 2013, 12:55 |
|
revolution 26 Apr 2013, 13:16
bitRAKE wrote: This lockless method is consistently faster on my system. Not as complex as it looks - bit lock on pieces and partitioned the counter into 16 pieces. Each thread operates on the opposite bit state. This is done so that each thread isn't just counting by itself - that would be too easy. So, there is actual forced contention for resources. Aside: I usually find that unless a new method is provably correct (or at least equal to the previous method it replaces) then the small increase in speed we might temporarily get is easily more than outweighed by the enormous loss of productivity we encounter when the new method produces incorrect results. |
|||
26 Apr 2013, 13:16 |
|
bitRAKE 26 Apr 2013, 13:22
Sorry, I made an edit. It is guaranteed to work. Vol3, Ch8 in the manual explains the memory ordering rules. It's lockless in that a LOCK# signal is not sent across the bus. Which can have a huge impact in performance. It's not a small gain - it can be over twice faster.
|
|||
26 Apr 2013, 13:22 |
|
revolution 26 Apr 2013, 13:24
Oh, I just saw this.
bitRAKE wrote: Edit: I should also say that it's guaranteed to work because of memory ordering rules on the bus. |
|||
26 Apr 2013, 13:24 |
|
revolution 26 Apr 2013, 13:27
bitRAKE wrote: Sorry, I made an edit. It is guaranteed to work. Vol3, Ch8 in the manual explains the memory ordering rules. It's lockless in that a LOCK# signal is not sent across the bus. Which can have a huge impact in performance. It's not a small gain - it can be over twice faster. I don't care about the performance gain if it can only be guaranteed to work on a subset of systems. |
|||
26 Apr 2013, 13:27 |
|
bitRAKE 26 Apr 2013, 13:45
Unless Intel is lying in the manual then it should work in general (Pentium+). I've been using Dual socket mobos since my overclocked BP6.
_________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
26 Apr 2013, 13:45 |
|
revolution 26 Apr 2013, 14:14
bitRAKE wrote: Unless Intel is lying in the manual then it should work in general (Pentium+). I've been using Dual socket mobos since my overclocked BP6. Anyhow, I also think you misread the manual with regard to bus ordering. I can't find anywhere that is stated that separated reads and writes are guaranteed not to be interleaved between threads. Also, the split 32-bit reads and writes you use have no guarantee that another thread or CPU won't take the bus in between operations. Aligning the data only guarantees the basic operation is atomic. There is no guarantee that multiple operations of smaller size are implicitly locked between accesses. The fact that you couldn't fault it with your test does not mean that it is working the way that you think it is. |
|||
26 Apr 2013, 14:14 |
|
bitRAKE 26 Apr 2013, 15:07
Using the notation from the manual with a slight modification:
Code: READ A.1.1 READ A.1.0 READ B.1.0 READ C.1.0 WRITE B.1.0 WRITE C.1.0 READ A.1.0 WRITE A.1.0 READ A.2.0 READ A.2.1 READ B.2.1 READ C.2.1 WRITE B.2.1 WRITE C.2.1 READ A.2.1 WRITE A.2.1 Last edited by bitRAKE on 26 Apr 2013, 15:21; edited 1 time in total |
|||
26 Apr 2013, 15:07 |
|
Goto page 1, 2, 3 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.