flat assembler
Message board for the users of flat assembler.

flat assembler > Examples and Tutorials > Fast sinus SSE.

Author
Thread Post new topic Reply to topic
Roman



Joined: 21 Apr 2012
Posts: 358
4 times faster than FPU fsin !!! Test Intel I-5 2320 3.0 GHz
Degree to radian. 90*0.0174532925 = 1.570796325
Code:
mov eax,1.570796325 ;radian = 90 degree movd xmm0,eax movss xmm1,[inv_pi] movss xmm3,[bias] mulss xmm1,xmm0 addss xmm1,xmm3 movss xmm2,xmm1 subss xmm1,xmm3 movss xmm4,[pi1] movss xmm3,[pi0] mulss xmm3,xmm1 pslld xmm2,31 ;nado li $31 subss xmm0,xmm3 mulss xmm4,xmm1 movss xmm6,[p0] subss xmm0,xmm4 movss xmm5,[pi2] mulss xmm5,xmm1 subss xmm0,xmm5 mulss xmm1,[pi3] subss xmm0,xmm1 movss xmm7,xmm0 mulss xmm7,xmm7 mulss xmm6,xmm7 pxor xmm0,xmm2 addss xmm6,[p1] mulss xmm6,xmm7 addss xmm6,[p2] mulss xmm6,xmm7 addss xmm6,[p3] mulss xmm6,xmm7 mulss xmm6,xmm0 addss xmm0,xmm6 movss [bits],xmm0 ;in bits result

And data:
Code:
bias dd 12582912.0 pi0 dd 3.140625 pi1 dd 9.670257568359375e-4 pi2 dd 6.2771141529083251953125e-7 pi3 dd 1.2154201013012385202950e-10 inv_pi dd 0.318309886 p0 dd 2.601429e-6 p1 dd -1.980698e-4 p2 dd +8.333017e-3 p3 dd -1.666665e-1 bits dd 0


Last edited by Roman on 10 Sep 2015, 06:26; edited 1 time in total
Post 10 Sep 2015, 05:18
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15872
Location: 162173 Ryugu
Is it only single precision?

What is the accuracy?

What is the input range?
Post 10 Sep 2015, 05:55
View user's profile Send private message Visit poster's website Reply with quote
Roman



Joined: 21 Apr 2012
Posts: 358
Input range float 32 bit.
Output float 32 bit value.
Accuracy. Seven signs after point.


Last edited by Roman on 10 Sep 2015, 06:28; edited 1 time in total
Post 10 Sep 2015, 06:17
View user's profile Send private message Reply with quote
Roman



Joined: 21 Apr 2012
Posts: 358
Windows calculator sin 10 degree = 0,17364817766693034885171662676931
FastSin sin 10 degree = 0,1736481786
Post 10 Sep 2015, 06:22
View user's profile Send private message Reply with quote
Roman



Joined: 21 Apr 2012
Posts: 358
FastSin can be useful in software render, or sound synthesis.
Post 10 Sep 2015, 06:25
View user's profile Send private message Reply with quote
RIxRIpt



Joined: 18 Apr 2013
Posts: 49
Roman, maybe you can be interested in this asm listing / snippet:
5 times faster on my cpu
Code:
format PE CONSOLE include 'win32ax.inc' entry main pow = 20 section '.code' code readable executable proc main mov esi, 1 shl pow .loop: rdtsc mov dword[time], eax mov dword[time + 4], edx repeat 50 movss xmm0, [src] ;unedited movss [esp - 4], xmm0 movss xmm1, [_ps_am_inv_sign_mask] mov eax, [esp - 4] mulss xmm0, [_ps_am_2_o_pi] andps xmm0, xmm1 and eax, 0x80000000 cvttss2si ecx, xmm0 movss xmm1, [_ps_am_1] mov edx, ecx shl edx, (31 - 1) cvtsi2ss xmm2, ecx and ecx, 0x1 and edx, 0x80000000 subss xmm0, xmm2 movss xmm6, [_sincos_masks + ecx * 4] minss xmm0, xmm1 movss xmm5, [_ps_sincos_p3] subss xmm1, xmm0 andps xmm1, xmm6 andnps xmm6, xmm0 orps xmm1, xmm6 movss xmm4, [_ps_sincos_p2] movss xmm0, xmm1 mulss xmm1, xmm1 movss xmm7, [_ps_sincos_p1] xor eax, edx movss xmm2, xmm1 mulss xmm1, xmm5 movss xmm5, [_ps_sincos_p0] mov [esp - 4], eax addss xmm1, xmm4 mulss xmm1, xmm2 movss xmm3, [esp - 4] addss xmm1, xmm7 mulss xmm1, xmm2 orps xmm0, xmm3 addss xmm1, xmm5 mulss xmm0, xmm1 cvtss2sd xmm0, xmm0 movlpd [dest], xmm0 end repeat rdtsc sub eax, dword[time] sbb edx, dword[time + 4] add dword[total_time], eax adc dword[total_time + 4], edx dec esi jnz .loop fild [total_time] fidiv [divv] fstp [total_time] cinvoke printf, fmt1, double[total_time], double[dest] xorps xmm0, xmm0 movlpd [total_time], xmm0 finit mov esi, 1 shl pow .loop2: rdtsc mov dword[time], eax mov dword[time + 4], edx repeat 50 fld [src] fsin fstp [dest] end repeat rdtsc sub eax, dword[time] sbb edx, dword[time + 4] add dword[total_time], eax adc dword[total_time + 4], edx dec esi jnz .loop2 fild [total_time] fidiv [divv] fstp [total_time] cinvoke printf, fmt2, double[total_time], double[dest] .below: cinvoke getch ret endp section '.data' data readable writeable time dq 0 total_time dq 0 divv dd 1 shl pow fmt1 db 'SSE2 %lf sin(45 degree): %lf', 10, 0 fmt2 db 'FPU %lf sin(45 degree): %lf', 10, 0 src dd 0.78539816339744830961566084581988 dest dq ? align 4 _ps_am_inv_sign_mask dd 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF _ps_am_pi_o_2 dd 1.57079632679, 1.57079632679, 1.57079632679, 1.57079632679 _ps_am_2_o_pi dd 0.63661977236, 0.63661977236, 0.63661977236, 0.63661977236 _epi32_1 dd 1, 1, 1, 1 _ps_am_1 dd 1.0, 1.0, 1.0, 1.0 _epi32_2 dd 2, 2, 2, 2 _ps_sincos_p3 dd -0.00468175413, -0.00468175413, -0.00468175413, -0.00468175413 _ps_sincos_p2 dd 0.0796926262, 0.0796926262, 0.0796926262, 0.0796926262 _ps_sincos_p1 dd -0.64596409750621,-0.64596409750621,-0.64596409750621,-0.64596409750621 _ps_sincos_p0 dd 1.570796326794896, 1.570796326794896, 1.570796326794896, 1.570796326794896 _sincos_masks dd 0x0, not 0x0 section '.idata' import data readable writeable library msvcrt, 'msvcrt.dll' import msvcrt,\ printf, 'printf',\ getch, '_getch'

src
Post 10 Sep 2015, 19:50
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15872
Location: 162173 Ryugu
RIxRIpt:

Is it only double precision?

What is the accuracy?

What is the input range?
Post 11 Sep 2015, 01:22
View user's profile Send private message Visit poster's website Reply with quote
Roman



Joined: 21 Apr 2012
Posts: 358
RIxRIpt
Tell your model CPU.
Your version fast sin can not be faster.
Because the principle is similar, and the same around number CPU of commands.
Your 41 commands. And using slow SSE commands andps(2 times), andnps, minss.
My 33 commands. Using two slow SSE commands pslld and pxor
Post 11 Sep 2015, 06:53
View user's profile Send private message Reply with quote
Siekmanski



Joined: 18 Jun 2006
Posts: 12
Location: The Netherlands
Another one, it's in masm notation.....

On my PC ( Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz) this single float routine ( calcs 4 at once ) is 196 times faster than fsin.


Code:
.data align 16 OneDivPi real4 4 dup (0.31830988618379067153776752674) Pi real4 4 dup (3.14159265358979323846264338327) ChebyChf40 real4 4 dup (-0.16666657096504701458241294000) ChebyChf41 real4 4 dup (0.00833301729156221812798629161) ChebyChf42 real4 4 dup (-0.00019806615201350805044116296) ChebyChf43 real4 4 dup (0.00000260005476789036127712325) .code align 16 SSE2sinSPf proc ; Calculate Sine of 1,2,3 or 4 packed single floating point values ( in: xmm0, result: xmm0 ) mulps xmm0,oword ptr OneDivPi ; 1/pi to get a 1pi range cvtps2dq xmm3,xmm0 ; (4 packed spfp to 4 packed int32) lose the fractional parts and keep it in xmm3 to save the signs cvtdq2ps xmm1,xmm3 ; (4 packed int32 to 4 packed spfp) save the integral parts subps xmm0,xmm1 ; now it's inside the range, results are values between -0.5 to 0.4999999 pslld xmm3,31 ; put sign-bits in position, to place values in the right hemispheres xorps xmm0,xmm3 ; set sign-bits mulps xmm0,oword ptr Pi ; restore ranges between -1/2 pi to +1/2 pi ; Now do the Chebyshev approximation of a 9th degree polynomial ; With 4 optimized constants for a maximum error of about 3.3381e-9 over -1/2 pi to +1/2 pi movaps xmm2,xmm0 mulps xmm2,xmm2 movaps xmm1,oword ptr ChebyChf43 mulps xmm1,xmm2 addps xmm1,oword ptr ChebyChf42 mulps xmm1,xmm2 addps xmm1,oword ptr ChebyChf41 mulps xmm1,xmm2 addps xmm1,oword ptr ChebyChf40 mulps xmm2,xmm0 mulps xmm1,xmm2 addps xmm0,xmm1 ret SSE2sinSPf endp


Last edited by Siekmanski on 11 Sep 2015, 10:22; edited 1 time in total
Post 11 Sep 2015, 08:39
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15872
Location: 162173 Ryugu
Siekmanski:

Is it only single precision?

What is the accuracy?

What is the input range?
Post 11 Sep 2015, 08:50
View user's profile Send private message Visit poster's website Reply with quote
Siekmanski



Joined: 18 Jun 2006
Posts: 12
Location: The Netherlands
Yes, this one calculates the sine of 4 single floating point values at once. (same code can be used to calculate also 1 or 2 or 3 values at once)
9 digits precision, enough for some purposes and it's fast.

The code keeps the input value range between -1/2 and +1/2 PI, so the input can be large but the larger the input value the more precision you'll loose.
Just try it out.

A double precision example,
It calculates the sine of 2 double floating point values at once:


Code:
.data align 16 OneDivPiDP real8 2 dup (0.31830988618379067153776752674) PiDP real8 2 dup (3.14159265358979323846264338327) ChebyChf0 real8 2 dup (-0.16666657096504701458241294000) ChebyChf1 real8 2 dup (0.00833301729156221812798629161) ChebyChf2 real8 2 dup (-0.00019806615201350805044116296) ChebyChf3 real8 2 dup (0.00000260005476789036127712325) .code align 16 SSE2sinDPf proc ; Calculate Sine of 1 double or 2 packed double floating point values ( in: xmm0, result: xmm0 ) mulpd xmm0,oword ptr OneDivPiDP ; 1/pi to get a 1pi range cvtpd2dq xmm3,xmm0 ; (2 packed dpfp to 2 int32) lose the fractional parts and keep it in xmm3 to save the signs cvtdq2pd xmm1,xmm3 ; (2 packed int32 to 2 dpfp) save the integral parts subpd xmm0,xmm1 ; now it's inside the range, results are values between -0.5 to 0.4999999 pslld xmm3,31 ; put sign-bits in position, to place values in the right hemispheres pshufd xmm3,xmm3,Shuffle(1,3,0,2) ; shuffle the sign-bits into place xorpd xmm0,xmm3 ; set sign-bits mulpd xmm0,oword ptr PiDP ; restore ranges between -1/2 pi to +1/2 pi ; Now do the Chebyshev approximation of a 9th degree polynomial ; With 4 optimized constants for a maximum error of about 3.3381e-9 over -1/2 pi to +1/2 pi movapd xmm2,xmm0 mulpd xmm2,xmm2 movapd xmm1,oword ptr ChebyChf3 mulpd xmm1,xmm2 addpd xmm1,oword ptr ChebyChf2 mulpd xmm1,xmm2 addpd xmm1,oword ptr ChebyChf1 mulpd xmm1,xmm2 addpd xmm1,oword ptr ChebyChf0 mulpd xmm2,xmm0 mulpd xmm1,xmm2 addpd xmm0,xmm1 ret SSE2sinDPf endp


Last edited by Siekmanski on 11 Sep 2015, 12:25; edited 1 time in total
Post 11 Sep 2015, 09:38
View user's profile Send private message Reply with quote
Roman



Joined: 21 Apr 2012
Posts: 358
Siekmanski
Great example !!!
You cool !!! Smile

But your example not faster than my example.
I test on my CPU and not get great result.
My example i run 1000000 and got 10
And i run your example 1000000 and got 10~25 !!!
Code:
invoke timeBeginPeriod,1 invoke timeGetTime mov dword [chVl],eax do cycle 1000000 Call FastSinus invoke timeGetTime mov ebx,dword [chVl] sub eax,ebx

In eax = all time
Post 11 Sep 2015, 11:50
View user's profile Send private message Reply with quote
Roman



Joined: 21 Apr 2012
Posts: 358
Siekmanski
O ! Sorry
I understend. Your example at once calculating four sinus(4 float32) !!!
Nice ! Very nice !!!

If i right real4 for Fasm DD.
DD 4 dup (0.31830988618379067153776752674)
Why need so long number for 32 bit number ?
Post 11 Sep 2015, 12:13
View user's profile Send private message Reply with quote
Roman



Joined: 21 Apr 2012
Posts: 358
Code:
movaps xmm0,dqword [FourRadians] mulps xmm0,dqword [OneDivPi] ; 1/pi to get a 1pi range cvtps2dq xmm3,xmm0 ; (4 packed spfp to 4 packed int32) lose the fractional parts and keep it in xmm3 to save the signs cvtdq2ps xmm1,xmm3 ; (4 packed int32 to 4 packed spfp) save the integral parts subps xmm0,xmm1 ; now it's inside the range, results are values between -0.5 to 0.4999999 pslld xmm3,31 ; put sign-bits in position, to place values in the right hemispheres xorps xmm0,xmm3 ; set sign-bits mulps xmm0,dqword [Pi] ; restore ranges between -1/2 pi to +1/2 pi ; Now do the Chebyshev approximation of a 9th degree polynomial ; With 4 optimized constants for a maximum error of about 3.3381e-9 over -1/2 pi to +1/2 pi movaps xmm2,xmm0 mulps xmm2,xmm2 movaps xmm1,dqword [ChebyChf43] mulps xmm1,xmm2 addps xmm1,dqword [ChebyChf42] mulps xmm1,xmm2 addps xmm1,dqword [ChebyChf41] mulps xmm1,xmm2 addps xmm1,dqword [ChebyChf40] mulps xmm2,xmm0 mulps xmm1,xmm2 addps xmm0,xmm1 movaps dqword [Sin1234],xmm0 align 16 OneDivPi dd 4 dup (0.3183098861837) Pi dd 4 dup (3.1415926535897) ChebyChf40 dd 4 dup (-0.1666665709650) ChebyChf41 dd 4 dup (0.0083330172915) ChebyChf42 dd 4 dup (-0.0001980661520) ChebyChf43 dd 4 dup (0.0000026000547) FourRadians dd 0.174532925,1.570796325,0.0174532925,1.570796325 ;four numbers radian Sin1234 dd 0,0,0,0 ;result four sinus
Post 11 Sep 2015, 13:23
View user's profile Send private message Reply with quote
Siekmanski



Joined: 18 Jun 2006
Posts: 12
Location: The Netherlands
ChebyChf40 real4 4 dup (-0.16666657096504701458241294) ;Masm
ChebyChf40 dd -0.16666657096504701458241294,-0.16666657096504701458241294,-0.16666657096504701458241294,-0.16666657096504701458241294 ; fasm

You are right it's too much precision for 32 bit ( Masm accepts it... ) I use the same constant-values for 32, 64, and 80 bit.
Post 11 Sep 2015, 13:35
View user's profile Send private message Reply with quote
RIxRIpt



Joined: 18 Apr 2013
Posts: 49
revolution wrote:
RIxRIpt:

Is it only double precision?

What is the accuracy?

What is the input range?

1. Single precision only, as you can see cvtss2sd xmm0, xmm0 in the bottom of the repeat block.
2. I took this code some time ago from intel's approx. math lib, cannot find it any more, so I'm not sure about accuracy. I guess it's about 10e-4
3. Any

Roman,
intel i7-2600 @ 3.40GHz
I wanted to say that it's about 5 times faster than fsin, not yours fastsin version.

_________________
Привет =3
Admins, please activate my account "RIscRIpt"
Post 12 Sep 2015, 11:33
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15872
Location: 162173 Ryugu
I question those that say the input range is "32-bit float" or "any". What is the sine of infinity? What is the sine of NaN? What is the sine of 2^65? What about handling of denormals? Does it take degrees or radians (or some other format)? How does the accuracy change for input values close to PI, or zero?

I guess some might read those questions as pedantic, but for these types of functions we should know the limitations and tradeoffs we are making. We can't simply compare based upon number-of-clock-cycles-to-run because the input and output might be of insufficient range, precision or accuracy to suit our needs.
Post 12 Sep 2015, 12:02
View user's profile Send private message Visit poster's website Reply with quote
pabloreda



Joined: 24 Jan 2007
Posts: 84
Location: Argentina
for the record

I use 16.16 fixed point.
for speed, use 1.0 for represent 1 turn/360 degree. with this avoid division and modulus.
The main rutine is not mine, I copy from code (I guess from Chuck Moore) and I convert from radian to 1.0 (the first multiplication)
work ok with 3d graphics.
I guess can avoid first multiplication recalculating the other constant but I don't know how.

this is the rutine compiled:
parameter in in eax, output in eax

Code:
w3: ; --- sinp --- ;dD:0 dU:-1 dR:-1 (4) and eax,$7FFF sub eax,$3FFF mov ebx,$64879 cdq imul ebx shrd eax,edx,$10 xchg eax,eax cdq imul eax shrd eax,edx,$10 mov ebx,$1F2 mov ecx,eax cdq imul ebx shrd eax,edx,$10 sub eax,$2A82 cdq imul ecx shrd eax,edx,$10 add eax,$10000 cdq imul ecx shrd eax,edx,$10 ret w4: ; --- cos --- ;dD:0 dU:-1 dR:-1 (2) add eax,$7FFF test eax,$8000 jz w3 call w3 ; sinp neg eax ret w5: ; --- sin --- ;dD:0 dU:-1 dR:-1 (2) add eax,$3FFF test eax,$8000 jz w3 call w3 ; sinp neg eax ret


The source code in FORTH like languaje, :R4
The number with . are converted to fixed point.

Code:
:sinp $7fff and $3fff - 6.2831 *. | radianes->bangle dup dup *. | f sqr dup 498 *. | f sqr resul 10882 - *. 1.0 + *. ; ::cos | v -- r $7fff + $8000 nand? ( sinp ; ) sinp neg ; ::sin | v -- r $3fff + $8000 nand? ( sinp ; ) sinp neg ;
Post 12 Sep 2015, 19:40
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >

Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 2004-2018, Tomasz Grysztar.

Powered by rwasa.