flat assembler
Message board for the users of flat assembler.
![]() |
Author |
|
tthsqe 14 Jun 2013, 07:03
division
Code: ;a = b / c : macro vdivpd_36 a,b,c,t,t_,_1 { vcvtpd2ps t_,c vrcpps t_,t_ vcvtps2pd t,t_ vfnmadd213pd c,t,_1 vmulpd a,t,b vfmadd231pd c,c,c vfmadd231pd a,a,c } ;a = b / c : macro vdivpd_60 a,b,c,t,t_,_1 { vcvtpd2ps t_,c vrcpps t_,t_ vcvtps2pd t,t_ vfnmadd213pd c,t,_1 vmulpd a,t,b vmovapd t,_1 vfmadd231pd t,c,c vfmadd231pd c,c,c vmulpd t,t,c vfmadd231pd a,a,t } |
|||
![]() |
|
tthsqe 14 Jun 2013, 07:05
reciprocal
Code: ; a = 1 / c : macro vrcppd_36 a,a_,c,_1 { vcvtpd2ps a_,c vrcpps a_,a_ vcvtps2pd a,a_ vfnmadd213pd c,a,_1 vfmadd231pd c,c,c vfmadd231pd a,a,c } ;a = 1 / c : macro vrcppd_60 a,a_,c,t,_1 { vcvtpd2ps a_,c vrcpps a_,a_ vcvtps2pd a,a_ vmovapd t,_1 vfnmadd213pd c,a,_1 vfmadd231pd t,c,c vfmadd231pd c,c,c vmulpd t,t,c vfmadd231pd a,a,t } |
|||
![]() |
|
tthsqe 14 Jun 2013, 07:07
division in Newton gravity calculation
Code: ; a = b / Sqrt[c] : macro vdivsqrtpd_36 a,b,c,t,t_,_1,_1d2,_3d8 { vcvtpd2ps t_,c vrsqrtps t_,t_ vcvtps2pd t,t_ vmulpd a,t,b vmulpd t,t,t vfnmadd213pd c,t,_1 vmovapd t,_3d8 vfmadd231pd t,c,_1d2 vmulpd c,c,a vfmadd231pd a,t,c } ; a = b / Sqrt[c] : macro vdivsqrtpd_48 a,b,c,t,t_,_1,_1d2,_3d8,_5d16 { vcvtpd2ps t,c vrsqrtps t_,t_ vcvtps2pd t,t_ vmulpd a,t,b vmulpd t,t,t vfnmadd213pd c,t,_1 vmovapd t,_5d16 vfmadd231pd t,c,_3d8 vfmadd231pd t,c,_1d2 vmulpd c,c,a vfmadd231pd a,t,c } ; a = b / Sqrt[c] : macro vdivsqrtpd_60 a,b,c,t,t_,_1,_1d2,_3d8,_5d16,_35d128 { vcvtpd2ps t_,c vrsqrtps t_,t_ vcvtps2pd t,t_ vmulpd a,t,b vmulpd t,t,t vfnmadd213pd c,t,_1 vmovapd t,_35d128 vfmadd213pd t,c,_5d16 vfmadd213pd t,c,_3d8 vfmadd213pd t,c,_1d2 vmulpd c,c,a vfmadd231pd a,t,c } |
|||
![]() |
|
revolution 14 Jun 2013, 09:59
Please show examples of how to use these macros.
[edit to add] Also, perhaps you should mention the minimum required CPU to execute these instructions. |
|||
![]() |
|
tthsqe 15 Jun 2013, 05:54
Ok, here is my testing apparatus. Since all of those fmadd's seemed to make you madd, I stuck to plain SSE. However, the speed up is about double with 256bit AVX vectors. The sin and cos is surprisingly accurate.
Code: format PE64 GUI 5.0 entry Start include 'win64a.inc' ; these macros perform various operations a = f(b,c) ; they all need extra temporary registers t,u,v,s... ; and a few rational constants _PdQ ( = vector of P/Q) that can reside in memory ; a, b, c, and t should all be different registers ; ; the macros have names fxn_N where N/12 is the order of the approximation ; N is the THEORETICAL number of correct bits in the result ; for example, divpd_60 uses a quintic algorithm since rccps gives 12 correct bits ; a = b / Sqrt[c] : 10x block throughput: 8 cycles macro divsqrtpd_36 a,b,c,t { cvtpd2ps t,c rsqrtps t,t cvtps2pd a,t movaps t,a mulpd a,b mulpd t,t mulpd t,c movaps c,dqword[const_pd_1] subpd c,t movaps t,c mulpd c,dqword[const_pd_3d8] addpd c,dqword[const_pd_1d2] mulpd t,a mulpd c,t addpd a,c } ; a = b / Sqrt[c] : 10x block throughput: 9 cycles macro divsqrtpd_48 a,b,c,t { cvtpd2ps t,c rsqrtps t,t cvtps2pd a,t movaps t,a mulpd a,b mulpd t,t mulpd t,c movaps c,dqword[const_pd_1] subpd c,t movaps t,c mulpd c,dqword[const_pd_5d16] addpd c,dqword[const_pd_3d8] mulpd c,t addpd c,dqword[const_pd_1d2] mulpd t,a mulpd c,t addpd a,c } ; a = b / Sqrt[c] : 10x block throughput: 10 cycles macro divsqrtpd_60 a,b,c,t,_1,_1d2,_3d8,_5d16,_35d128 { cvtpd2ps t,c rsqrtps t,t cvtps2pd a,t movaps t,a mulpd a,b mulpd t,t mulpd t,c movaps c,dqword[const_pd_1] subpd c,t movaps t,c mulpd c,dqword[const_pd_35d128] addpd c,dqword[const_pd_5d16] mulpd c,t addpd c,dqword[const_pd_3d8] mulpd c,t addpd c,dqword[const_pd_1d2] mulpd t,a mulpd c,t addpd a,c } ;a = 1 / c : 10x block throughput: 5 cycles macro rcppd_36 a,c,t { cvtpd2ps t,c rcpps t,t cvtps2pd t,t movaps a,t mulpd t,c movaps c,dqword[const_pd_1] subpd c,t movaps t,c mulpd c,c addpd c,t mulpd c,a addpd a,c } ;a = 1 / c : 10x block throughput: 6 cycles macro rcppd_60 a,c,t { cvtpd2ps t,c rcpps t,t cvtps2pd t,t movaps a,t mulpd t,c movaps c,dqword[const_pd_1] subpd c,t movaps t,c mulpd t,c addpd c,t addpd t,dqword[const_pd_1] mulpd t,c mulpd t,a addpd a,t } ;a = b / c : 10x block throughput: 6 cycles macro divpd_36 a,b,c,t { cvtpd2ps t,c rcpps t,t cvtps2pd a,t movaps t,a mulpd a,b mulpd t,c movaps c,dqword[const_pd_1] subpd c,t movaps t,c mulpd c,c addpd c,t mulpd c,a addpd a,c } ;a = b / c : 10x block throughput: 7 cycles macro divpd_60 a,b,c,t { cvtpd2ps t,c rcpps t,t cvtps2pd a,t movaps t,a mulpd a,b mulpd t,c movaps c,dqword[const_pd_1] subpd c,t movaps t,c mulpd t,c addpd c,t addpd t,dqword[const_pd_1] mulpd t,c mulpd t,a addpd a,t } ; {a,b} = {Sin[c],Cos[c]} macro sincospd_50 a,b,c,t,s { ; two temporaries mulpd c,dqword[const_pd_c0] cvtpd2dq s,c cvtdq2pd t,s subpd c,t movaps t,dqword[const_pd_1d16] mulpd t,c mulpd t,t pand s,dqword[const_dq_3] cvtdq2pd s,s movaps a,dqword[const_pd_s4] mulpd a,t addpd a,dqword[const_pd_s3] mulpd a,t addpd a,dqword[const_pd_s2] mulpd a,t addpd a,dqword[const_pd_s1] mulpd a,t addpd a,dqword[const_pd_s0] movaps b,dqword[const_pd_c4] mulpd b,t addpd b,dqword[const_pd_c3] mulpd b,t addpd b,dqword[const_pd_c2] mulpd b,t addpd b,dqword[const_pd_c1] mulpd b,t repeat 4 movaps t,a mulpd t,b subpd a,t movapd t,dqword[const_pd_m2] mulpd t,b addpd t,dqword[const_pd_4] mulpd b,t end repeat movaps t,s subpd t,dqword[const_pd_1] addpd s,dqword[const_pd_m2] andpd t,dqword[const_pd_abs] andpd s,dqword[const_pd_abs] subpd t,dqword[const_pd_1] subpd s,dqword[const_pd_1] mulpd a,c subpd b,dqword[const_pd_1] ;a = a*s+b*t ;b a*t-b*s movaps c,a mulpd c,t mulpd t,b mulpd a,s mulpd s,b addpd a,t movaps b,c subpd b,s } macro START_MARKER { mov ebx,111 db 0x64,0x67,0x90 } macro END_MARKER { mov ebx,222 db 0x64,0x67,0x90 } section '.text' code readable executable Start: push rbp LOOPS = 20000000 mov ecx,LOOPS/100 .warmup_loop: movaps xmm2,dqword[c] sincospd_50 xmm0,xmm1,xmm2,xmm3,xmm4 ; now {xmm0,xmm1} = {Sin[xmm2],Cos[xmm2]} ; xmm2,xmm3,xmm4 are garbage sub ecx,1 jnz .warmup_loop invoke QueryPerformanceCounter,Time1 mov ecx,LOOPS align 64 .approx_loop: repeat 4 movaps xmm1,dqword[b] movaps xmm2,dqword[c] divsqrtpd_48 xmm0,xmm1,xmm2,xmm3 movaps dqword[a],xmm0 end repeat sub ecx,1 jnz .approx_loop invoke QueryPerformanceCounter,Time2 mov ecx,LOOPS align 64 .exact_loop: repeat 4 movapd xmm0,dqword[b] sqrtpd xmm2,dqword[c] divpd xmm0,xmm2 movapd dqword[A],xmm0 end repeat sub ecx,1 jnz .exact_loop invoke QueryPerformanceCounter,Time3 lea rdi,[Message] mov rax,'speed up' stosq mov rax,'factor: ' stosq fild dword[thousand] fild qword[Time1] fild qword[Time3] fild qword[Time2] fsub st2,st0 fsubrp st1,st0 fdivrp st1,st0 fmulp st1,st0 fistp dword[rsp-8] mov eax,dword[rsp-8] xor edx,edx div dword[thousand] push rdx call PrintInteger mov al,'.' stosb pop rax xor edx,edx mov ecx,100 div ecx push rdx call PrintInteger pop rax xor edx,edx mov ecx,10 div ecx push rdx call PrintInteger pop rax call PrintInteger mov al,0x0d stosb irps i, 0 1 { fld qword[A+8*i] call PrintBinaryFloat mov al,0xd stosb fld qword[a+8*i] call PrintBinaryFloat mov al,0xd stosb mov eax,'----' stosd mov al,0xd stosb } mov al,0 stosb invoke MessageBoxA,0,Message,Caption,MB_OK invoke ExitProcess,0 PrintBinaryFloat: fstp tword[rsp-16+4] mov ax,word[rsp-16+12] mov byte[rdi],'+' bt eax,15 jnc @f mov byte[rdi],'-' @@: add rdi,1 and ax,0x7FFF jz .Zero cmp ax,0x7FFF je .Infinity mov rdx,qword[rsp-16+4] mov al,'0' shl rdx,1 adc eax,0 stosb mov al,'.' stosb mov ecx,52 @@: mov al,'0' shl rdx,1 adc eax,0 stosb sub ecx,1 jnz @b mov eax,'*2^' stosd sub rdi,1 movzx eax,word[rsp-16+12] and eax,0x07FFF sub rax,16383 call PrintInteger jmp .ret .Zero: mov al,'0' stosb jmp .ret .Infinity: mov rax,[rsp-16+4] test rax,rax jns .Nan mov eax,'Inf ' stosd jmp .ret .Nan: mov eax,'NaN ' stosd fstp st0 ; jmp .ret .ret: ret PrintInteger: ; rax: number push rbp push rdx mov rbp,rsp test rax,rax jns .l1 mov byte[rdi],'-' add rdi,1 neg rax .l1: xor edx,edx div [OutputBase] push rdx test rax,rax jnz .l1 .l2: pop rax cmp al,9 jbe @f add al,'a'-10-'0' @@: add al,'0' stosb cmp rsp,rbp jb .l2 pop rdx pop rbp ret section '.data' data readable writeable align 16 ; results A: dq ?,? a dq ?,? ; operands b: dq 3.89,7.98 c: dq 5.17,6.17 ; constants const_pd_m2 dq -2.0,-2.0 const_pd_4 dq 4.0,4.0 const_pd_1: dq 1.0,1.0 const_pd_1d2 dq 0.5,0.5 const_pd_3d8 dq 0.375,0.375 const_pd_5d16 dq 0.3125,0.3125 const_pd_1d16 dq 0.0625,0.0625 const_pd_35d128 dq 0.2734375, 0.2734375 const_pd_c0 dq 0.63661977236758134308,0.63661977236758134308 const_pd_c1 dq 1.2337005501361698274,1.2337005501361698274 const_pd_c2 dq -0.25366950790104801364,-0.25366950790104801364 const_pd_c3 dq 0.020863480763352960873,0.020863480763352960873 const_pd_c4 dq -0.00091926027483942658024,-0.00091926027483942658024 const_pd_s0 dq 1.5707963267948966192,1.5707963267948966192 const_pd_s1 dq -0.64596409750624625366,-0.64596409750624625366 const_pd_s2 dq 0.079692626246167045121,0.079692626246167045121 const_pd_s3 dq -0.0046817541353186881007,-0.0046817541353186881007 const_pd_s4 dq 0.00016044118478735982187,0.00016044118478735982187 const_pd_abs dq 0x7FFFFFFFFFFFFFFF,0x7FFFFFFFFFFFFFFF const_dq_3 dd 3,3,3,3 align 8 OutputBase dq 10 Time1 dq ? Time2 dq ? Time3 dq ? align 4 thousand dd 1000 align 1 Caption db 'test',0 Message rb 1024 section '.idata' import data readable writeable library kernel32,'KERNEL32.DLL',\ user32,'USER32.DLL',\ gdi32,'GDI32.DLL',\ msvcrt,'MSVCRT.DLL' include 'api\kernel32.inc' include 'api\user32.inc' include 'api\gdi32.inc' import msvcrt,\ sprintf,'sprintf' |
|||
![]() |
|
DOS386 15 Jun 2013, 09:48
Quote: you should mention the minimum required CPU to execute these instructions. IIRC I coded an 80386-compatible (no FPU) square root function and a 8080-and-6502-compatible (no DIV instruction) decimal output routine some time ago. You know, programming is art ![]() PS: buying a new PC every week is not art _________________ Bug Nr.: 12345 Title: Hello World program compiles to 100 KB !!! Status: Closed: NOT a Bug |
|||
![]() |
|
tthsqe 15 Jun 2013, 15:24
The question is not if it can be done (even without an FPU), but rather if we can get close to the desired accuracy much FASTER than the built-in functions.
I also do not have a haswell cpu - I was just illustrating how the haswell new instructions can simplify things tremendously. Whether or not your solution uses fancy instructions, as long as it is elegant I think it is good art. The solutions in my most recent point will run on the early AMD 64 bit architecture. PS: would you mind posting your no-FPU no-DIV square root solution? Does it work on single, double, or extended precision? Or is it for integer square roots? |
|||
![]() |
|
DOS386 16 Jun 2013, 02:27
tthsqe wrote: posting your no-FPU no-DIV square root solution? 32-bit integers (and it has DIV, but the 64-bit decimal output is DIV-free). I also coded a 66-Byte's DIV-free-8080-compatible SINE function: http://www.freebasic.net/forum/viewtopic.php?t=12490 |
|||
![]() |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.