flat assembler
Message board for the users of flat assembler.

Index > Tutorials and Examples > Accurate multi-threaded 64-bit counters on a 32-bit machine

Goto page Previous  1, 2, 3
Author
Thread Post new topic Reply to topic
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode 04 May 2013, 19:29
and one should stay content with the timing of the oldsafe proc, once
considering the recomended Intel's solution. from Example 8-4.
http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf
on my Yorkfield it times at least 3x slower !!
some little improvement, by using PAUSE to signal entering the wait-loop for the new safe proc, but only unstable 10%. it may work better on older processors though.

_________________
⠓⠕⠏⠉⠕⠙⠑
Post 04 May 2013, 19:29
View user's profile Send private message Visit poster's website Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode 06 May 2013, 11:25
while checking for updates on azillionmonkeys today, i have had an idea reading the code there.
it may achieve 2 benefits:
1) save some power while spinning
2) smooth aggressivity of the thread, allowing other threads to hit successfully
Code:
 mov esi,ADDER_VALUE and 0xffffffff
 mov edi,ADDER_VALUE shr 32
                        
.loop:
  pause 
  times 4 nop
  mov eax, dword [counter]   ; [[1]]
  mov edx, dword [counter+4] ; [[2]]
  mov ebx, eax
  mov ecx, edx
  add ebx, esi
  adc ecx, edi
  lock cmpxchg8b  qword [counter]        ; [[3]]
  jnz .loop

  dec [count]
  jnz .loop
  ret
    
the pause works as an hint for newer processors "entering a locking loop".
it translates to a "nop" on older processors. the 4 nop there have been calculated on latency/rec.througput of the following instructions
their goal should be to desynchronize the loop, using half the latency of itself, resulting so as "yelding" other threads
the access to the resource. here following tests on Yorkfield (Quad 45nm)

revolution_atom64.exe
Code:
unsafe  Count:156602052344132608 Time:125
unsafe2 Count:200000000000000000 Time:1451  <---
oldsafe Count:200000000000000000 Time:2074
safe    Count:200000000000000000 Time:1592

unsafe  Count:140552233700888576 Time:110
unsafe2 Count:200000000000000000 Time:1482   <---
oldsafe Count:200000000000000000 Time:1918
safe    Count:200000000000000000 Time:1358

unsafe  Count:140235594251510784 Time:109
unsafe2 Count:200000000000000000 Time:1545   <---
oldsafe Count:200000000000000000 Time:2012
safe    Count:200000000000000000 Time:1295
    
stabler results. then running 8 threads all-together,all of them using the proc above, accessing 4 counters,

atom64.exe
Code:
Counter:2000000000001000 Time:1294
Counter:2000000000001000 Time:1294
Counter:2000000000001000 Time:1294
Counter:2000000000001000 Time:1294

Counter:2000000000001000 Time:1279
Counter:2000000000001000 Time:1279
Counter:2000000000001000 Time:1279
Counter:2000000000001000 Time:1279

Counter:2000000000001000 Time:1279
Counter:2000000000001000 Time:1279
Counter:2000000000001000 Time:1279
Counter:2000000000001000 Time:1279    
very stable, and fast. to confirm this i managed to run the same tests on
my older P4 650 Prescott 90nm, single core

revolution_atom64.exe
Code:
unsafe  Count:111203394649338880 Time:406
unsafe2 Count:200000000000000000 Time:1454 <---
oldsafe Count:200000000000000000 Time:2265
safe    Count:200000000000000000 Time:1578

unsafe  Count:109429369441373184 Time:422
unsafe2 Count:200000000000000000 Time:1469  <---
oldsafe Count:200000000000000000 Time:2265
safe    Count:200000000000000000 Time:1578

unsafe  Count:113975042205012992 Time:407
unsafe2 Count:200000000000000000 Time:1578  <---
oldsafe Count:200000000000000000 Time:2250
safe    Count:200000000000000000 Time:1593

unsafe  Count:111315688249406464 Time:422
unsafe2 Count:200000000000000000 Time:1484  <---
oldsafe Count:200000000000000000 Time:2250
safe    Count:200000000000000000 Time:1594
    
and again 8 threads, accessing 4 counters

atom64.exe
Code:
Counter:2000000000001000 Time:593
Counter:2000000000001000 Time:593
Counter:2000000000001000 Time:593
Counter:2000000000001000 Time:593

Counter:2000000000001000 Time:312
Counter:2000000000001000 Time:312
Counter:2000000000001000 Time:312
Counter:2000000000001000 Time:312

Counter:2000000000001000 Time:704
Counter:2000000000001000 Time:704
Counter:2000000000001000 Time:704
Counter:2000000000001000 Time:704

Counter:2000000000001000 Time:453
Counter:2000000000001000 Time:453
Counter:2000000000001000 Time:453
Counter:2000000000001000 Time:453
    
and this is pretty satisfying, imo.
Cheers,
Very Happy

_________________
⠓⠕⠏⠉⠕⠙⠑
Post 06 May 2013, 11:25
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20448
Location: In your JS exploiting you and your system
revolution 06 May 2013, 12:11
hopcode: Thanks for the update. But note that your optimisations are probably only sensible for this particular test code. In a real program I doubt that such things would be necessary, and might even be harmful to performance. Only proper testing would show which. Although it is good to have alternatives available for people to try,

But perhaps this thread is starting to go a little bit away from the original purpose of this topic? The only thing important is the accuracy. Having the timings optimised for a particular CPU/mobo combinations is not important or interesting unless there is an improvement that is workable on all CPUs, and is going to give at least double the performance. If not, than all this extra time messing about is probably wasted in the long term scheme of things.
Post 06 May 2013, 12:11
View user's profile Send private message Visit poster's website Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode 06 May 2013, 12:59
yeap, agree. from my side i can say those tests being merely a confirmation after some theoretical acquaintance with the CPU-specs.
i have seen sometimes professionals doing fast calculations on fingers without needing to write/test a single line of code;
and results showing not that large discrepancy from reality. that is but what i just like to learn, because there is a lot of
different new and newer hardware out there. having then several manuals and specs doesnt help, when all time is wasted for testing.
Very Happy

_________________
⠓⠕⠏⠉⠕⠙⠑
Post 06 May 2013, 12:59
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.