flat assembler
Message board for the users of flat assembler.
Index
> Linux > Nbody shootout bench [SSE] 
Author 

Melissa
I have implemented this : http://shootout.alioth.debian.org/u64q/program.php?test=nbody&lang=java&id=2
java program in fasm and stumbled upon execution speed, as my program executes at 23 seconds on q6600 @ 2.4GHz, while java program executes at 22 seconds. Also Intel Fortran executes at 15 seconds So I have tried to optimize macro advance a bit and this is what I've got so far. Thank for any help in optimizing. Seems that I am bad at assembler edit: I have followed fortran program and rearranged loops so that magnitude calculation can be vectorized and now program executes at 16 seconds which is still two seconds behind fortran edit2: did some minor arrangement's of code so now it executes at Intel Fortran's speed.
Last edited by Melissa on 13 Apr 2012, 17:50; edited 3 times in total 

12 Apr 2012, 09:02 

LocoDelAssembly
Do you have the assembly listing of the Fortran code? It would be interesting to see why it kicks everybody's ass (and also to start optimizing from there )


12 Apr 2012, 20:24 

Melissa
LocoDelAssembly
Unfortunately I don't have Intel Fortran, only gfortran which is real slow (more than twice) . Interestingly with gfortran normal program http://shootout.alioth.debian.org/u64q/program.php?test=nbody&lang=ifc&id=1 is faster twice than fortran program on first place;) Seems that library functions with Intel Fortran are much faster than gfortran's as first place fortran program does things differently. (seems even slower when implemented in assembly) bubach I have further optimized a bit, in macro advance reorganized loops so that nested loop exits at bottom rather than jumps backwards than exit, which gave me another second, so now I'm faster than java for half of second I think that this program can reach fortran's speed, if somehow divsd instruction in macro advance can be optimized. As it takes considerably large execution time. bah just single div 

12 Apr 2012, 21:16 

Madis731
Isn't Intel using some binary tricks instead of divsd? I know some performance primitives (IPP) allow to choose from different precision and speed. Unless there's a fault in the shootout checking system, asm should be equal or faster than any other program.
For example if reciprocal div will get you enough precision (or reciprocal + Newton) and you get the same answer (0.169075164,0.169059907) then you might try that. DIV and SQRT always remain the slowest, but usually you can get rid of the SQRT by squaring both sides of the equation. I haven't looked at the code thoroughly so maybe my suggestion is wrong, but its worth a try. 

13 Apr 2012, 06:48 

LocoDelAssembly
The sqrt is used to multiply it with the square to calculate a cube, so anther way to get it would be to just pow the square to 3/2 using that impossible to understand pow_sse instruction sequence that we talked about some time ago in this board, but would it be faster?
I don't know any DIV tricks, but there is really nothing that can be done when the dividend is a constant? 

13 Apr 2012, 15:37 

LocoDelAssembly
It seems the goal was achieved: http://groups.google.com/group/comp.lang.asm.x86/msg/3a1286c201645155


13 Apr 2012, 18:08 

Melissa
Yes, I have succeeded. Used movapd instead of movupd
and some minor arrangements of code and now executes at 14 secs It would be interesting to speed it up further 

13 Apr 2012, 18:33 

cod3b453
Just a few ideas, not tried any and they're not guaranteed to improve things:
 Replace divides with reciprocal multiplies Code: mulsd xmm0,[R_SOLAR_MASS] ;divsd xmm0,[SOLAR_MASS] R_SOLAR_MASS dq 0.25 ;SOLAR_MASS dq 4.0  Interleave/reorder dependencies Code: movsd xmm1, [.iBody.x] subsd xmm1, [.jBody.x] movsd xmm2, [.iBody.y] subsd xmm2, [.jBody.y] movsd xmm3, [.iBody.z] subsd xmm3, [.jBody.z] Code: movsd xmm1, [.iBody.x] movsd xmm2, [.iBody.y] movsd xmm3, [.iBody.z] subsd xmm1, [.jBody.x] subsd xmm2, [.jBody.y] subsd xmm3, [.jBody.z]  More parallelisation Code: movsd xmm1, [.iBody.x] subsd xmm1, [.jBody.x] movsd xmm2, [.iBody.y] subsd xmm2, [.jBody.y] movsd xmm3, [.iBody.z] subsd xmm3, [.jBody.z] Code: movdqa xmm1,[.iBody.x] ; y x movdqa xmm2,[.iBody.z] ; ? z subsd xmm1,[.jBody.x] ; dy dx subsd xmm2,[.jBody.z] ; ? dz  Precompute constant calculations at assembly time (i.e. init_body)  Precompute the inner loop and store in memory at run time before running outer loop; slower for smaller n but hopefully faster for larger n Hpe that helps 

27 Apr 2012, 17:40 

r22
I'd also add, align your Loop Labels by 8 or 16. Use the correct 0x66 0x90 NOP sequence, http://board.flatassembler.net/topic.php?t=4445
Code: macro NOPPad8 { virtual align 16 a = $$$ end virtual if a=1 db 90h end if if a=2 db 66h,90h end if if a=3 db 66h,66h,90h end if if a=4 db 66h,66h,66h,90h end if if a=5 db 66h,66h,90h,66h,90h end if if a=6 db 66h,66h,90h,66h,66h,90h end if if a=7 db 66h,66h,66h,90h,66h,66h,90h end if } ... NOPPad8 LoopLabel: ... JNZ LoopLabel 

01 May 2012, 20:03 

tthsqe
Hey, could you try the following and see how it affects the performance?
As I recall, nbody simulations require the computation of Code: a^(1/2) and a^(3/2) I recommend converting these to single precision, using the approximate RSQRPS is instruction to get a good guess, and then iterating of Newton's method at most twice to get about double precision, If you unroll the loops you can also take advantage of the fact that mulpd, addpd, ect. are pipelined, while I believe that sqrtpd and divpd are not. Code: suppose x is close to a^(1/2) (init. by CVT2PD(RSQRPS(CVT2PS(a)))) then 1/2 * x * (3  a * x*x) is closer to a^(1/2) You will need at most 2 iterations because RSQRPS is accurate to 12 bits, and , of course, the constant multiple of 1/2 can be pulled out of the sum 

05 Jun 2012, 08:30 

tthsqe
Did some testing  the program compares the times between the straightforward approach and the iterative one. Method 2 should only differ from Method 1 in the last 5 or 6 bits (5348=5) and is about 3 times faster. (Sorry its for windows, but that is all i have)
Code: format PE64 GUI 5.0 entry Start include 'win64a.inc' section '.text' code readable executable Start: push rbp ; Method 1 invoke QueryPerformanceCounter,Time mov ecx,10000000 align 16 .loop: movapd xmm0,dqword[x] movapd xmm1,dqword[y] sqrtpd xmm2,xmm1 divpd xmm0,xmm2 movapd dqword[z1],xmm0 ; z = x/sqrt(y) sub ecx,1 jnz .loop mulpd xmm0,dqword[const2] movapd dqword[z1],xmm0 ; z = 2*x/sqrt(y) mov rdi,[Time] invoke QueryPerformanceCounter,Time mov rax,[Time] sub rax,rdi mov [Time1],rax ; Method 2 movaps xmm7,dqword[const3] movaps xmm6,dqword[const1d2] mov ecx,10000000 align 16 .loop2: movapd xmm0,dqword[x] movapd xmm1,dqword[y] cvtpd2ps xmm2,xmm1 rsqrtps xmm2,xmm2 cvtps2pd xmm2,xmm2 movaps xmm3,xmm2 mulpd xmm3,xmm2 mulpd xmm2,xmm6 mulpd xmm3,xmm1 subpd xmm3,xmm7 mulpd xmm3,xmm2 movaps xmm2,xmm3 mulpd xmm2,xmm3 mulpd xmm2,xmm1 subpd xmm2,xmm7 mulpd xmm2,xmm3 mulpd xmm0,xmm2 movaps dqword[z2],xmm0 ; z = 2*x/sqrt(y) sub ecx,1 jnz .loop2 mov rdi,[Time] invoke QueryPerformanceCounter,Time mov rax,[Time] sub rax,rdi mov [Time2],rax invoke sprintf,Message,MessageFormat,qword[z1+8*0],qword[z1+8*1],dword[Time1],qword[z2+8*0],qword[z2+8*1],dword[Time2] invoke MessageBoxA,0,Message,Caption,MB_OK invoke ExitProcess,0 ret section '.data' data readable writeable align 16 x dq 3.30000,2.500000 y dq 0.12345,0.412345 z1 dq 0,0 z2 dq 0,0 const3 dq 3.0,3.0 const1d2 dq 0.5,0.5 const2 dq 2.0,2.0 align 8 Time dq 0 Time1 dq 0 Time2 dq 0 align 1 Caption db 'asdf',0 MessageFormat db 'loop1 result: %f,%f in %i ticks',0x0d,0x0a,'loop2 result: %f,%f in %i ticks',0 Message rb 200 section '.idata' import data readable writeable library kernel32,'KERNEL32.DLL',\ user32,'USER32.DLL',\ gdi32,'GDI32.DLL',\ msvcrt,'MSVCRT.DLL' include 'api\kernel32.inc' include 'api\user32.inc' include 'api\gdi32.inc' import msvcrt,\ sprintf,'sprintf' 

05 Jun 2012, 10:46 

Melissa
Thanks, method2 should produce exact same result as this goes 50 million times
in a loop. thanks again 

12 Jun 2012, 20:33 

Melissa
I have finally made reciprocal sqrt implementation and yes,
it is fast Done sse2 version which executes in 8.5 seconds! on q6600 vs 14 seconds for fastest fortran on site Also done avx version which is much faster than sse version


15 Oct 2012, 09:45 

tthsqe
Good to hear  I actually made an opengl real time nbody simulator with collisions and stuff like that, and I remember that with single threaded avx (I don't remember if I used newton for the inverse square root), I could handle all of the interactions between about 4000 balls before it started to get a little laggy.
I would recommend putting a GUI on your simulator  It is really quite fun to watch. 

16 Oct 2012, 05:09 

< Last Thread  Next Thread > 
Forum Rules:

Copyright © 19992020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.
Website powered by rwasa.