flat assembler
Message board for the users of flat assembler.
Index
> Tutorials and Examples > Fast sinus SSE. |
Author |
|
revolution 10 Sep 2015, 05:55
Is it only single precision?
What is the accuracy? What is the input range? |
|||
10 Sep 2015, 05:55 |
|
Roman 10 Sep 2015, 06:17
Input range float 32 bit.
Output float 32 bit value. Accuracy. Seven signs after point. Last edited by Roman on 10 Sep 2015, 06:28; edited 1 time in total |
|||
10 Sep 2015, 06:17 |
|
Roman 10 Sep 2015, 06:22
Windows calculator sin 10 degree = 0,17364817766693034885171662676931
FastSin sin 10 degree = 0,1736481786 |
|||
10 Sep 2015, 06:22 |
|
Roman 10 Sep 2015, 06:25
FastSin can be useful in software render, or sound synthesis.
|
|||
10 Sep 2015, 06:25 |
|
RIxRIpt 10 Sep 2015, 19:50
Roman, maybe you can be interested in this asm listing / snippet:
5 times faster on my cpu Code: format PE CONSOLE include 'win32ax.inc' entry main pow = 20 section '.code' code readable executable proc main mov esi, 1 shl pow .loop: rdtsc mov dword[time], eax mov dword[time + 4], edx repeat 50 movss xmm0, [src] ;unedited movss [esp - 4], xmm0 movss xmm1, [_ps_am_inv_sign_mask] mov eax, [esp - 4] mulss xmm0, [_ps_am_2_o_pi] andps xmm0, xmm1 and eax, 0x80000000 cvttss2si ecx, xmm0 movss xmm1, [_ps_am_1] mov edx, ecx shl edx, (31 - 1) cvtsi2ss xmm2, ecx and ecx, 0x1 and edx, 0x80000000 subss xmm0, xmm2 movss xmm6, [_sincos_masks + ecx * 4] minss xmm0, xmm1 movss xmm5, [_ps_sincos_p3] subss xmm1, xmm0 andps xmm1, xmm6 andnps xmm6, xmm0 orps xmm1, xmm6 movss xmm4, [_ps_sincos_p2] movss xmm0, xmm1 mulss xmm1, xmm1 movss xmm7, [_ps_sincos_p1] xor eax, edx movss xmm2, xmm1 mulss xmm1, xmm5 movss xmm5, [_ps_sincos_p0] mov [esp - 4], eax addss xmm1, xmm4 mulss xmm1, xmm2 movss xmm3, [esp - 4] addss xmm1, xmm7 mulss xmm1, xmm2 orps xmm0, xmm3 addss xmm1, xmm5 mulss xmm0, xmm1 cvtss2sd xmm0, xmm0 movlpd [dest], xmm0 end repeat rdtsc sub eax, dword[time] sbb edx, dword[time + 4] add dword[total_time], eax adc dword[total_time + 4], edx dec esi jnz .loop fild [total_time] fidiv [divv] fstp [total_time] cinvoke printf, fmt1, double[total_time], double[dest] xorps xmm0, xmm0 movlpd [total_time], xmm0 finit mov esi, 1 shl pow .loop2: rdtsc mov dword[time], eax mov dword[time + 4], edx repeat 50 fld [src] fsin fstp [dest] end repeat rdtsc sub eax, dword[time] sbb edx, dword[time + 4] add dword[total_time], eax adc dword[total_time + 4], edx dec esi jnz .loop2 fild [total_time] fidiv [divv] fstp [total_time] cinvoke printf, fmt2, double[total_time], double[dest] .below: cinvoke getch ret endp section '.data' data readable writeable time dq 0 total_time dq 0 divv dd 1 shl pow fmt1 db 'SSE2 %lf sin(45 degree): %lf', 10, 0 fmt2 db 'FPU %lf sin(45 degree): %lf', 10, 0 src dd 0.78539816339744830961566084581988 dest dq ? align 4 _ps_am_inv_sign_mask dd 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF _ps_am_pi_o_2 dd 1.57079632679, 1.57079632679, 1.57079632679, 1.57079632679 _ps_am_2_o_pi dd 0.63661977236, 0.63661977236, 0.63661977236, 0.63661977236 _epi32_1 dd 1, 1, 1, 1 _ps_am_1 dd 1.0, 1.0, 1.0, 1.0 _epi32_2 dd 2, 2, 2, 2 _ps_sincos_p3 dd -0.00468175413, -0.00468175413, -0.00468175413, -0.00468175413 _ps_sincos_p2 dd 0.0796926262, 0.0796926262, 0.0796926262, 0.0796926262 _ps_sincos_p1 dd -0.64596409750621,-0.64596409750621,-0.64596409750621,-0.64596409750621 _ps_sincos_p0 dd 1.570796326794896, 1.570796326794896, 1.570796326794896, 1.570796326794896 _sincos_masks dd 0x0, not 0x0 section '.idata' import data readable writeable library msvcrt, 'msvcrt.dll' import msvcrt,\ printf, 'printf',\ getch, '_getch' src |
|||
10 Sep 2015, 19:50 |
|
revolution 11 Sep 2015, 01:22
RIxRIpt:
Is it only double precision? What is the accuracy? What is the input range? |
|||
11 Sep 2015, 01:22 |
|
Roman 11 Sep 2015, 06:53
RIxRIpt
Tell your model CPU. Your version fast sin can not be faster. Because the principle is similar, and the same around number CPU of commands. Your 41 commands. And using slow SSE commands andps(2 times), andnps, minss. My 33 commands. Using two slow SSE commands pslld and pxor |
|||
11 Sep 2015, 06:53 |
|
Siekmanski 11 Sep 2015, 08:39
Another one, it's in masm notation.....
On my PC ( Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz) this single float routine ( calcs 4 at once ) is 196 times faster than fsin. Code: .data align 16 OneDivPi real4 4 dup (0.31830988618379067153776752674) Pi real4 4 dup (3.14159265358979323846264338327) ChebyChf40 real4 4 dup (-0.16666657096504701458241294000) ChebyChf41 real4 4 dup (0.00833301729156221812798629161) ChebyChf42 real4 4 dup (-0.00019806615201350805044116296) ChebyChf43 real4 4 dup (0.00000260005476789036127712325) .code align 16 SSE2sinSPf proc ; Calculate Sine of 1,2,3 or 4 packed single floating point values ( in: xmm0, result: xmm0 ) mulps xmm0,oword ptr OneDivPi ; 1/pi to get a 1pi range cvtps2dq xmm3,xmm0 ; (4 packed spfp to 4 packed int32) lose the fractional parts and keep it in xmm3 to save the signs cvtdq2ps xmm1,xmm3 ; (4 packed int32 to 4 packed spfp) save the integral parts subps xmm0,xmm1 ; now it's inside the range, results are values between -0.5 to 0.4999999 pslld xmm3,31 ; put sign-bits in position, to place values in the right hemispheres xorps xmm0,xmm3 ; set sign-bits mulps xmm0,oword ptr Pi ; restore ranges between -1/2 pi to +1/2 pi ; Now do the Chebyshev approximation of a 9th degree polynomial ; With 4 optimized constants for a maximum error of about 3.3381e-9 over -1/2 pi to +1/2 pi movaps xmm2,xmm0 mulps xmm2,xmm2 movaps xmm1,oword ptr ChebyChf43 mulps xmm1,xmm2 addps xmm1,oword ptr ChebyChf42 mulps xmm1,xmm2 addps xmm1,oword ptr ChebyChf41 mulps xmm1,xmm2 addps xmm1,oword ptr ChebyChf40 mulps xmm2,xmm0 mulps xmm1,xmm2 addps xmm0,xmm1 ret SSE2sinSPf endp Last edited by Siekmanski on 11 Sep 2015, 10:22; edited 1 time in total |
|||
11 Sep 2015, 08:39 |
|
revolution 11 Sep 2015, 08:50
Siekmanski:
Is it only single precision? What is the accuracy? What is the input range? |
|||
11 Sep 2015, 08:50 |
|
Siekmanski 11 Sep 2015, 09:38
Yes, this one calculates the sine of 4 single floating point values at once. (same code can be used to calculate also 1 or 2 or 3 values at once)
9 digits precision, enough for some purposes and it's fast. The code keeps the input value range between -1/2 and +1/2 PI, so the input can be large but the larger the input value the more precision you'll loose. Just try it out. A double precision example, It calculates the sine of 2 double floating point values at once: Code: .data align 16 OneDivPiDP real8 2 dup (0.31830988618379067153776752674) PiDP real8 2 dup (3.14159265358979323846264338327) ChebyChf0 real8 2 dup (-0.16666657096504701458241294000) ChebyChf1 real8 2 dup (0.00833301729156221812798629161) ChebyChf2 real8 2 dup (-0.00019806615201350805044116296) ChebyChf3 real8 2 dup (0.00000260005476789036127712325) .code align 16 SSE2sinDPf proc ; Calculate Sine of 1 double or 2 packed double floating point values ( in: xmm0, result: xmm0 ) mulpd xmm0,oword ptr OneDivPiDP ; 1/pi to get a 1pi range cvtpd2dq xmm3,xmm0 ; (2 packed dpfp to 2 int32) lose the fractional parts and keep it in xmm3 to save the signs cvtdq2pd xmm1,xmm3 ; (2 packed int32 to 2 dpfp) save the integral parts subpd xmm0,xmm1 ; now it's inside the range, results are values between -0.5 to 0.4999999 pslld xmm3,31 ; put sign-bits in position, to place values in the right hemispheres pshufd xmm3,xmm3,Shuffle(1,3,0,2) ; shuffle the sign-bits into place xorpd xmm0,xmm3 ; set sign-bits mulpd xmm0,oword ptr PiDP ; restore ranges between -1/2 pi to +1/2 pi ; Now do the Chebyshev approximation of a 9th degree polynomial ; With 4 optimized constants for a maximum error of about 3.3381e-9 over -1/2 pi to +1/2 pi movapd xmm2,xmm0 mulpd xmm2,xmm2 movapd xmm1,oword ptr ChebyChf3 mulpd xmm1,xmm2 addpd xmm1,oword ptr ChebyChf2 mulpd xmm1,xmm2 addpd xmm1,oword ptr ChebyChf1 mulpd xmm1,xmm2 addpd xmm1,oword ptr ChebyChf0 mulpd xmm2,xmm0 mulpd xmm1,xmm2 addpd xmm0,xmm1 ret SSE2sinDPf endp Last edited by Siekmanski on 11 Sep 2015, 12:25; edited 1 time in total |
|||
11 Sep 2015, 09:38 |
|
Roman 11 Sep 2015, 11:50
Siekmanski
Great example !!! You cool !!! But your example not faster than my example. I test on my CPU and not get great result. My example i run 1000000 and got 10 And i run your example 1000000 and got 10~25 !!! Code: invoke timeBeginPeriod,1 invoke timeGetTime mov dword [chVl],eax do cycle 1000000 Call FastSinus invoke timeGetTime mov ebx,dword [chVl] sub eax,ebx In eax = all time |
|||
11 Sep 2015, 11:50 |
|
Roman 11 Sep 2015, 12:13
Siekmanski
O ! Sorry I understend. Your example at once calculating four sinus(4 float32) !!! Nice ! Very nice !!! If i right real4 for Fasm DD. DD 4 dup (0.31830988618379067153776752674) Why need so long number for 32 bit number ? |
|||
11 Sep 2015, 12:13 |
|
Roman 11 Sep 2015, 13:23
Code: movaps xmm0,dqword [FourRadians] mulps xmm0,dqword [OneDivPi] ; 1/pi to get a 1pi range cvtps2dq xmm3,xmm0 ; (4 packed spfp to 4 packed int32) lose the fractional parts and keep it in xmm3 to save the signs cvtdq2ps xmm1,xmm3 ; (4 packed int32 to 4 packed spfp) save the integral parts subps xmm0,xmm1 ; now it's inside the range, results are values between -0.5 to 0.4999999 pslld xmm3,31 ; put sign-bits in position, to place values in the right hemispheres xorps xmm0,xmm3 ; set sign-bits mulps xmm0,dqword [Pi] ; restore ranges between -1/2 pi to +1/2 pi ; Now do the Chebyshev approximation of a 9th degree polynomial ; With 4 optimized constants for a maximum error of about 3.3381e-9 over -1/2 pi to +1/2 pi movaps xmm2,xmm0 mulps xmm2,xmm2 movaps xmm1,dqword [ChebyChf43] mulps xmm1,xmm2 addps xmm1,dqword [ChebyChf42] mulps xmm1,xmm2 addps xmm1,dqword [ChebyChf41] mulps xmm1,xmm2 addps xmm1,dqword [ChebyChf40] mulps xmm2,xmm0 mulps xmm1,xmm2 addps xmm0,xmm1 movaps dqword [Sin1234],xmm0 align 16 OneDivPi dd 4 dup (0.3183098861837) Pi dd 4 dup (3.1415926535897) ChebyChf40 dd 4 dup (-0.1666665709650) ChebyChf41 dd 4 dup (0.0083330172915) ChebyChf42 dd 4 dup (-0.0001980661520) ChebyChf43 dd 4 dup (0.0000026000547) FourRadians dd 0.174532925,1.570796325,0.0174532925,1.570796325 ;four numbers radian Sin1234 dd 0,0,0,0 ;result four sinus |
|||
11 Sep 2015, 13:23 |
|
Siekmanski 11 Sep 2015, 13:35
ChebyChf40 real4 4 dup (-0.16666657096504701458241294) ;Masm
ChebyChf40 dd -0.16666657096504701458241294,-0.16666657096504701458241294,-0.16666657096504701458241294,-0.16666657096504701458241294 ; fasm You are right it's too much precision for 32 bit ( Masm accepts it... ) I use the same constant-values for 32, 64, and 80 bit. |
|||
11 Sep 2015, 13:35 |
|
RIxRIpt 12 Sep 2015, 11:33
revolution wrote: RIxRIpt: 1. Single precision only, as you can see cvtss2sd xmm0, xmm0 in the bottom of the repeat block. 2. I took this code some time ago from intel's approx. math lib, cannot find it any more, so I'm not sure about accuracy. I guess it's about 10e-4 3. Any Roman, intel i7-2600 @ 3.40GHz I wanted to say that it's about 5 times faster than fsin, not yours fastsin version. _________________ Привет =3 Admins, please activate my account "RIscRIpt" |
|||
12 Sep 2015, 11:33 |
|
revolution 12 Sep 2015, 12:02
I question those that say the input range is "32-bit float" or "any". What is the sine of infinity? What is the sine of NaN? What is the sine of 2^65? What about handling of denormals? Does it take degrees or radians (or some other format)? How does the accuracy change for input values close to PI, or zero?
I guess some might read those questions as pedantic, but for these types of functions we should know the limitations and tradeoffs we are making. We can't simply compare based upon number-of-clock-cycles-to-run because the input and output might be of insufficient range, precision or accuracy to suit our needs. |
|||
12 Sep 2015, 12:02 |
|
pabloreda 12 Sep 2015, 19:40
for the record
I use 16.16 fixed point. for speed, use 1.0 for represent 1 turn/360 degree. with this avoid division and modulus. The main rutine is not mine, I copy from code (I guess from Chuck Moore) and I convert from radian to 1.0 (the first multiplication) work ok with 3d graphics. I guess can avoid first multiplication recalculating the other constant but I don't know how. this is the rutine compiled: parameter in in eax, output in eax Code: w3: ; --- sinp --- ;dD:0 dU:-1 dR:-1 (4) and eax,$7FFF sub eax,$3FFF mov ebx,$64879 cdq imul ebx shrd eax,edx,$10 xchg eax,eax cdq imul eax shrd eax,edx,$10 mov ebx,$1F2 mov ecx,eax cdq imul ebx shrd eax,edx,$10 sub eax,$2A82 cdq imul ecx shrd eax,edx,$10 add eax,$10000 cdq imul ecx shrd eax,edx,$10 ret w4: ; --- cos --- ;dD:0 dU:-1 dR:-1 (2) add eax,$7FFF test eax,$8000 jz w3 call w3 ; sinp neg eax ret w5: ; --- sin --- ;dD:0 dU:-1 dR:-1 (2) add eax,$3FFF test eax,$8000 jz w3 call w3 ; sinp neg eax ret The source code in FORTH like languaje, :R4 The number with . are converted to fixed point. Code: :sinp $7fff and $3fff - 6.2831 *. | radianes->bangle dup dup *. | f sqr dup 498 *. | f sqr resul 10882 - *. 1.0 + *. ; ::cos | v -- r $7fff + $8000 nand? ( sinp ; ) sinp neg ; ::sin | v -- r $3fff + $8000 nand? ( sinp ; ) sinp neg ; |
|||
12 Sep 2015, 19:40 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.