flat assembler
Message board for the users of flat assembler.

Index > Linux > Nbody shootout bench [SSE]

Author
Thread Post new topic Reply to topic
Melissa



Joined: 12 Apr 2012
Posts: 71
Melissa
I have implemented this : http://shootout.alioth.debian.org/u64q/program.php?test=nbody&lang=java&id=2
java program in fasm and stumbled upon execution speed,
as my program executes at 23 seconds on q6600 @ 2.4GHz,
while java program executes at 22 seconds. Also Intel Fortran
executes at 15 seconds Wink
So I have tried to optimize macro advance a bit
and this is what I've got so far.

Thank for any help in optimizing. Seems that I am
bad at assembler Wink

edit: I have followed fortran program and rearranged
loops so that magnitude calculation can be vectorized
and now program executes at 16 seconds which is
still two seconds behind fortran Wink

edit2: did some minor arrangement's of code so now it executes
at Intel Fortran's speed.


Description:
Download
Filename: nbody2.asm
Filesize: 8.04 KB
Downloaded: 241 Time(s)

Description:
Download
Filename: nbody1.asm
Filesize: 7.31 KB
Downloaded: 220 Time(s)



Last edited by Melissa on 13 Apr 2012, 17:50; edited 3 times in total
Post 12 Apr 2012, 09:02
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Do you have the assembly listing of the Fortran code? It would be interesting to see why it kicks everybody's ass (and also to start optimizing from there Wink)
Post 12 Apr 2012, 20:24
View user's profile Send private message Reply with quote
bubach



Joined: 17 Sep 2004
Posts: 340
Location: Trollhättan, Sweden
bubach
Ouch, beaten by a JAVA program? I'd consider that a serious slap in the face.. Laughing
Yet your asm source seem pretty advanced so I'm really confused as how it could be such a difference...
Post 12 Apr 2012, 20:54
View user's profile Send private message Reply with quote
Melissa



Joined: 12 Apr 2012
Posts: 71
Melissa
LocoDelAssembly
Unfortunately I don't have Intel Fortran, only gfortran which is real
slow (more than twice) . Interestingly with gfortran normal program
http://shootout.alioth.debian.org/u64q/program.php?test=nbody&lang=ifc&id=1
is faster twice than fortran program on first place;)
Seems that library functions with Intel Fortran are much faster
than gfortran's as first place fortran program does things
differently. (seems even slower when implemented in assembly)

bubach
I have further optimized a bit, in macro advance reorganized loops
so that nested loop exits at bottom rather than jumps backwards than exit,
which gave me another second, so now I'm faster than java for half of second Wink

I think that this program can reach fortran's speed, if somehow divsd instruction
in macro advance can be optimized. As it takes considerably large execution
time. bah just single div Wink
Post 12 Apr 2012, 21:16
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
Isn't Intel using some binary tricks instead of divsd? I know some performance primitives (IPP) allow to choose from different precision and speed. Unless there's a fault in the shootout checking system, asm should be equal or faster than any other program.

For example if reciprocal div will get you enough precision (or reciprocal + Newton) and you get the same answer (-0.169075164,-0.169059907) then you might try that. DIV and SQRT always remain the slowest, but usually you can get rid of the SQRT by squaring both sides of the equation. I haven't looked at the code thoroughly so maybe my suggestion is wrong, but its worth a try.
Post 13 Apr 2012, 06:48
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
The sqrt is used to multiply it with the square to calculate a cube, so anther way to get it would be to just pow the square to 3/2 using that impossible to understand pow_sse instruction sequence that we talked about some time ago in this board, but would it be faster?

I don't know any DIV tricks, but there is really nothing that can be done when the dividend is a constant?
Post 13 Apr 2012, 15:37
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Post 13 Apr 2012, 18:08
View user's profile Send private message Reply with quote
Melissa



Joined: 12 Apr 2012
Posts: 71
Melissa
Yes, I have succeeded. Used movapd instead of movupd
and some minor arrangements of code and now executes
at 14 secs Wink

It would be interesting to speed it up further Wink
Post 13 Apr 2012, 18:33
View user's profile Send private message Reply with quote
cod3b453



Joined: 25 Aug 2004
Posts: 619
cod3b453
Just a few ideas, not tried any and they're not guaranteed to improve things:

- Replace divides with reciprocal multiplies
Code:
mulsd xmm0,[R_SOLAR_MASS] ;divsd xmm0,[SOLAR_MASS]
R_SOLAR_MASS dq 0.25 ;SOLAR_MASS dq 4.0    


- Interleave/reorder dependencies
Code:
        movsd xmm1, [.iBody.x]  
        subsd xmm1, [.jBody.x]

        movsd xmm2, [.iBody.y]  
        subsd xmm2, [.jBody.y]

        movsd xmm3, [.iBody.z]  
        subsd xmm3, [.jBody.z]    
to
Code:
        movsd xmm1, [.iBody.x]  
        movsd xmm2, [.iBody.y] 
        movsd xmm3, [.iBody.z] 
        
        subsd xmm1, [.jBody.x]
        subsd xmm2, [.jBody.y]
        subsd xmm3, [.jBody.z]    

- More parallelisation
Code:
        movsd xmm1, [.iBody.x]  
        subsd xmm1, [.jBody.x]

        movsd xmm2, [.iBody.y]  
        subsd xmm2, [.jBody.y]

        movsd xmm3, [.iBody.z]  
        subsd xmm3, [.jBody.z]    
to
Code:
        movdqa xmm1,[.iBody.x] ; y x
        movdqa xmm2,[.iBody.z] ; ? z
        subsd xmm1,[.jBody.x] ; dy dx
        subsd xmm2,[.jBody.z] ; ? dz    

- Precompute constant calculations at assembly time (i.e. init_body)

- Precompute the inner loop and store in memory at run time before running outer loop; slower for smaller n but hopefully faster for larger n

Hpe that helps Cool
Post 27 Apr 2012, 17:40
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
I'd also add, align your Loop Labels by 8 or 16. Use the correct 0x66 0x90 NOP sequence, http://board.flatassembler.net/topic.php?t=4445

Code:
macro NOPPad8
{
    virtual
        align 16
        a = $-$$
    end virtual
    if a=1
       db 90h
    end if
    if a=2
       db 66h,90h
    end if
    if a=3
       db 66h,66h,90h
    end if
    if a=4
       db 66h,66h,66h,90h
    end if
    if a=5
       db 66h,66h,90h,66h,90h
    end if
    if a=6
       db 66h,66h,90h,66h,66h,90h
    end if
    if a=7
       db 66h,66h,66h,90h,66h,66h,90h
    end if 
}
...
NOPPad8
LoopLabel:
...
JNZ LoopLabel
    
Post 01 May 2012, 20:03
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 724
tthsqe
Hey, could you try the following and see how it affects the performance?
As I recall, nbody simulations require the computation of
Code:
a^(-1/2)   and  a^(-3/2)    

I recommend converting these to single precision, using the approximate RSQRPS is instruction to get a good guess, and then iterating of Newton's method at most twice to get about double precision, If you unroll the loops you can also take advantage of the fact that mulpd, addpd, ect. are pipelined, while I believe that sqrtpd and divpd are not.

Code:
suppose x is close to a^(-1/2) (init. by CVT2PD(RSQRPS(CVT2PS(a))))
then 1/2 * x * (3 - a * x*x) is closer to a^(-1/2)    


You will need at most 2 iterations because RSQRPS is accurate to 12 bits, and , of course, the constant multiple of 1/2 can be pulled out of the sum
Post 05 Jun 2012, 08:30
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 724
tthsqe
Did some testing - the program compares the times between the straightforward approach and the iterative one. Method 2 should only differ from Method 1 in the last 5 or 6 bits (53-48=5) and is about 3 times faster. (Sorry its for windows, but that is all i have)
Code:
format PE64 GUI 5.0
entry Start

include 'win64a.inc'




section '.text' code readable executable

Start:                 push  rbp


; Method 1
                     invoke  QueryPerformanceCounter,Time

                        mov  ecx,10000000
align 16
         .loop:      movapd  xmm0,dqword[x]
                     movapd  xmm1,dqword[y]
                     sqrtpd  xmm2,xmm1
                      divpd  xmm0,xmm2
                     movapd  dqword[z1],xmm0   ; z = x/sqrt(y)
                        sub  ecx,1
                        jnz  .loop
                      mulpd  xmm0,dqword[const2]
                     movapd  dqword[z1],xmm0   ; z = 2*x/sqrt(y)


                        mov  rdi,[Time]
                     invoke  QueryPerformanceCounter,Time
                        mov  rax,[Time]
                        sub  rax,rdi
                        mov  [Time1],rax


; Method 2
                     movaps  xmm7,dqword[const3]
                     movaps  xmm6,dqword[const1d2]

                        mov  ecx,10000000
align 16
         .loop2:     movapd  xmm0,dqword[x]
                     movapd  xmm1,dqword[y]

                   cvtpd2ps  xmm2,xmm1
                    rsqrtps  xmm2,xmm2
                   cvtps2pd  xmm2,xmm2

                     movaps  xmm3,xmm2
                      mulpd  xmm3,xmm2
                      mulpd  xmm2,xmm6
                      mulpd  xmm3,xmm1
                      subpd  xmm3,xmm7
                      mulpd  xmm3,xmm2

                     movaps  xmm2,xmm3
                      mulpd  xmm2,xmm3
                      mulpd  xmm2,xmm1
                      subpd  xmm2,xmm7
                      mulpd  xmm2,xmm3

                      mulpd  xmm0,xmm2
                     movaps  dqword[z2],xmm0   ; z = 2*x/sqrt(y)
                        sub  ecx,1
                        jnz  .loop2

                        mov  rdi,[Time]
                     invoke  QueryPerformanceCounter,Time
                        mov  rax,[Time]
                        sub  rax,rdi
                        mov  [Time2],rax



                     invoke  sprintf,Message,MessageFormat,qword[z1+8*0],qword[z1+8*1],dword[Time1],qword[z2+8*0],qword[z2+8*1],dword[Time2]
                     invoke  MessageBoxA,0,Message,Caption,MB_OK


                     invoke  ExitProcess,0
                        ret




section '.data' data readable writeable

align 16
  x dq 3.30000,2.500000
  y dq 0.12345,0.412345
  z1 dq 0,0
  z2 dq 0,0
const3 dq 3.0,3.0
const1d2 dq 0.5,0.5
const2 dq 2.0,2.0

align 8
  Time dq 0
  Time1 dq 0
  Time2 dq 0

align 1
  Caption db 'asdf',0
  MessageFormat db 'loop1 result: %f,%f in %i ticks',0x0d,0x0a,'loop2 result: %f,%f in %i ticks',0
  Message rb 200



section '.idata' import data readable writeable

  library kernel32,'KERNEL32.DLL',\
          user32,'USER32.DLL',\
          gdi32,'GDI32.DLL',\
          msvcrt,'MSVCRT.DLL'

include 'api\kernel32.inc'
include 'api\user32.inc'
include 'api\gdi32.inc'


 import  msvcrt,\
            sprintf,'sprintf'    
Post 05 Jun 2012, 10:46
View user's profile Send private message Reply with quote
Melissa



Joined: 12 Apr 2012
Posts: 71
Melissa
Thanks, method2 should produce exact same result as this goes 50 million times
in a loop.

thanks again Wink
Post 12 Jun 2012, 20:33
View user's profile Send private message Reply with quote
Melissa



Joined: 12 Apr 2012
Posts: 71
Melissa
I have finally made reciprocal sqrt implementation and yes,
it is fast Wink
Done sse2 version which executes in 8.5 seconds! on q6600
vs 14 seconds for fastest fortran on site Wink
Also done avx version which is much faster than sse version Wink


Description:
Download
Filename: nbodyavx.asm
Filesize: 8.24 KB
Downloaded: 226 Time(s)

Description:
Download
Filename: nbody2.asm
Filesize: 8.51 KB
Downloaded: 233 Time(s)

Post 15 Oct 2012, 09:45
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 724
tthsqe
Good to hear - I actually made an opengl real time n-body simulator with collisions and stuff like that, and I remember that with single threaded avx (I don't remember if I used newton for the inverse square root), I could handle all of the interactions between about 4000 balls before it started to get a little laggy.
I would recommend putting a GUI on your simulator - It is really quite fun to watch.
Post 16 Oct 2012, 05:09
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on YouTube, Twitter.

Website powered by rwasa.