flat assembler
Message board for the users of flat assembler.

Index > Tutorials and Examples > Fast sinus SSE.

Author
Thread Post new topic Reply to topic
Roman



Joined: 21 Apr 2012
Posts: 1848
Roman 10 Sep 2015, 05:18
4 times faster than FPU fsin !!! Test Intel I-5 2320 3.0 GHz
Degree to radian. 90*0.0174532925 = 1.570796325
Code:
        mov     eax,1.570796325  ;radian = 90 degree
        movd       xmm0,eax
        movss      xmm1,[inv_pi]
        movss      xmm3,[bias]    
        mulss       xmm1,xmm0
        addss       xmm1,xmm3
        movss      xmm2,xmm1
        subss       xmm1,xmm3
        movss      xmm4,[pi1]
        movss      xmm3,[pi0]
        mulss      xmm3,xmm1
        pslld        xmm2,31     ;nado li $31
        subss      xmm0,xmm3
        mulss      xmm4,xmm1
        movss      xmm6,[p0]
        subss      xmm0,xmm4
        movss      xmm5,[pi2]
        mulss      xmm5,xmm1
        subss      xmm0,xmm5
        mulss      xmm1,[pi3]
        subss      xmm0,xmm1
        movss      xmm7,xmm0
        mulss      xmm7,xmm7
        mulss      xmm6,xmm7
        pxor       xmm0,xmm2
        addss      xmm6,[p1]
        mulss      xmm6,xmm7
        addss      xmm6,[p2]
        mulss      xmm6,xmm7
        addss      xmm6,[p3]
        mulss      xmm6,xmm7
        mulss      xmm6,xmm0
        addss      xmm0,xmm6
        movss      [bits],xmm0   ;in bits result
    

And data:
Code:
        bias    dd 12582912.0
        pi0     dd 3.140625
        pi1     dd 9.670257568359375e-4
        pi2     dd 6.2771141529083251953125e-7
        pi3     dd 1.2154201013012385202950e-10
        inv_pi  dd 0.318309886
        p0      dd 2.601429e-6
        p1      dd -1.980698e-4
        p2      dd +8.333017e-3
        p3      dd -1.666665e-1
        bits    dd 0            
    


Last edited by Roman on 10 Sep 2015, 06:26; edited 1 time in total
Post 10 Sep 2015, 05:18
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20451
Location: In your JS exploiting you and your system
revolution 10 Sep 2015, 05:55
Is it only single precision?

What is the accuracy?

What is the input range?
Post 10 Sep 2015, 05:55
View user's profile Send private message Visit poster's website Reply with quote
Roman



Joined: 21 Apr 2012
Posts: 1848
Roman 10 Sep 2015, 06:17
Input range float 32 bit.
Output float 32 bit value.
Accuracy. Seven signs after point.


Last edited by Roman on 10 Sep 2015, 06:28; edited 1 time in total
Post 10 Sep 2015, 06:17
View user's profile Send private message Reply with quote
Roman



Joined: 21 Apr 2012
Posts: 1848
Roman 10 Sep 2015, 06:22
Windows calculator sin 10 degree = 0,17364817766693034885171662676931
FastSin sin 10 degree = 0,1736481786
Post 10 Sep 2015, 06:22
View user's profile Send private message Reply with quote
Roman



Joined: 21 Apr 2012
Posts: 1848
Roman 10 Sep 2015, 06:25
FastSin can be useful in software render, or sound synthesis.
Post 10 Sep 2015, 06:25
View user's profile Send private message Reply with quote
RIxRIpt



Joined: 18 Apr 2013
Posts: 50
RIxRIpt 10 Sep 2015, 19:50
Roman, maybe you can be interested in this asm listing / snippet:
5 times faster on my cpu
Code:
format PE CONSOLE

include 'win32ax.inc'

entry main

pow = 20

section '.code' code readable executable
        proc main
                mov esi, 1 shl pow
                .loop:
                        rdtsc
                        mov dword[time], eax
                        mov dword[time + 4], edx
                        
                        repeat 50
                                movss xmm0, [src]
                                ;unedited
                                movss   [esp - 4], xmm0
                                movss   xmm1, [_ps_am_inv_sign_mask]
                                mov             eax, [esp - 4]
                                mulss   xmm0, [_ps_am_2_o_pi]
                                andps   xmm0, xmm1
                                and             eax, 0x80000000
                                cvttss2si       ecx, xmm0
                                movss   xmm1, [_ps_am_1]
                                mov             edx, ecx
                                shl             edx, (31 - 1)
                                cvtsi2ss        xmm2, ecx
                                and             ecx, 0x1
                                and             edx, 0x80000000
                                subss   xmm0, xmm2
                                movss   xmm6, [_sincos_masks + ecx * 4]
                                minss   xmm0, xmm1
                                movss   xmm5, [_ps_sincos_p3]
                                subss   xmm1, xmm0
                                andps   xmm1, xmm6
                                andnps  xmm6, xmm0
                                orps    xmm1, xmm6
                                movss   xmm4, [_ps_sincos_p2]
                                movss   xmm0, xmm1
                                mulss   xmm1, xmm1
                                movss   xmm7, [_ps_sincos_p1]
                                xor             eax, edx
                                movss   xmm2, xmm1
                                mulss   xmm1, xmm5
                                movss   xmm5, [_ps_sincos_p0]
                                mov             [esp - 4], eax
                                addss   xmm1, xmm4
                                mulss   xmm1, xmm2
                                movss   xmm3, [esp - 4]
                                addss   xmm1, xmm7
                                mulss   xmm1, xmm2
                                orps    xmm0, xmm3
                                addss   xmm1, xmm5
                                mulss   xmm0, xmm1
                                
                                cvtss2sd xmm0, xmm0
                                movlpd [dest], xmm0
                        end repeat
                        
                        rdtsc
                        sub eax, dword[time]
                        sbb edx, dword[time + 4]
                        
                        add dword[total_time], eax
                        adc dword[total_time + 4], edx
                        dec esi
                jnz .loop
                                
                fild [total_time]
                fidiv [divv]
                fstp [total_time]
                
                cinvoke printf, fmt1, double[total_time], double[dest]
                
                xorps xmm0, xmm0
                movlpd [total_time], xmm0
                
                finit
                
                mov esi, 1 shl pow
                .loop2:
                        rdtsc
                        mov dword[time], eax
                        mov dword[time + 4], edx
                        
                        repeat 50
                                fld [src]
                                fsin
                                fstp [dest]
                        end repeat
                        
                        rdtsc
                        sub eax, dword[time]
                        sbb edx, dword[time + 4]
                        
                        add dword[total_time], eax
                        adc dword[total_time + 4], edx
                        dec esi
                jnz .loop2
                                
                fild [total_time]
                fidiv [divv]
                fstp [total_time]
                
                cinvoke printf, fmt2, double[total_time], double[dest]
                
                .below:
                cinvoke getch
                ret
        endp

section '.data' data readable writeable
        time dq 0
        total_time dq 0
        divv dd 1 shl pow
        fmt1 db 'SSE2 %lf sin(45 degree): %lf', 10, 0
        fmt2 db 'FPU  %lf sin(45 degree): %lf', 10, 0
        
        src dd 0.78539816339744830961566084581988
        dest dq ?
        
        align 4
        
    _ps_am_inv_sign_mask dd 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF
    _ps_am_pi_o_2 dd 1.57079632679, 1.57079632679, 1.57079632679, 1.57079632679
    _ps_am_2_o_pi dd 0.63661977236, 0.63661977236, 0.63661977236, 0.63661977236
    _epi32_1 dd 1, 1, 1, 1
    _ps_am_1 dd 1.0, 1.0, 1.0, 1.0
    _epi32_2 dd 2, 2, 2, 2

    _ps_sincos_p3 dd -0.00468175413, -0.00468175413, -0.00468175413, -0.00468175413
    _ps_sincos_p2 dd 0.0796926262, 0.0796926262, 0.0796926262, 0.0796926262
    _ps_sincos_p1 dd -0.64596409750621,-0.64596409750621,-0.64596409750621,-0.64596409750621 
    _ps_sincos_p0 dd 1.570796326794896, 1.570796326794896, 1.570796326794896, 1.570796326794896

        _sincos_masks dd 0x0, not 0x0
        
section '.idata' import data readable writeable
        library msvcrt, 'msvcrt.dll'
        import msvcrt,\
                printf, 'printf',\
                getch, '_getch'


    

src
Post 10 Sep 2015, 19:50
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20451
Location: In your JS exploiting you and your system
revolution 11 Sep 2015, 01:22
RIxRIpt:

Is it only double precision?

What is the accuracy?

What is the input range?
Post 11 Sep 2015, 01:22
View user's profile Send private message Visit poster's website Reply with quote
Roman



Joined: 21 Apr 2012
Posts: 1848
Roman 11 Sep 2015, 06:53
RIxRIpt
Tell your model CPU.
Your version fast sin can not be faster.
Because the principle is similar, and the same around number CPU of commands.
Your 41 commands. And using slow SSE commands andps(2 times), andnps, minss.
My 33 commands. Using two slow SSE commands pslld and pxor
Post 11 Sep 2015, 06:53
View user's profile Send private message Reply with quote
Siekmanski



Joined: 18 Jun 2006
Posts: 12
Location: The Netherlands
Siekmanski 11 Sep 2015, 08:39
Another one, it's in masm notation.....

On my PC ( Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz) this single float routine ( calcs 4 at once ) is 196 times faster than fsin.


Code:
.data
align 16

OneDivPi    real4 4 dup (0.31830988618379067153776752674)
Pi          real4 4 dup (3.14159265358979323846264338327)
ChebyChf40  real4 4 dup (-0.16666657096504701458241294000)
ChebyChf41  real4 4 dup (0.00833301729156221812798629161)
ChebyChf42  real4 4 dup (-0.00019806615201350805044116296)
ChebyChf43  real4 4 dup (0.00000260005476789036127712325)

.code

align 16
SSE2sinSPf proc                             ; Calculate Sine of 1,2,3 or 4 packed single floating point values ( in: xmm0, result: xmm0 )

    mulps       xmm0,oword ptr OneDivPi     ; 1/pi to get a 1pi range
    cvtps2dq    xmm3,xmm0                   ; (4 packed spfp to 4 packed int32) lose the fractional parts and keep it in xmm3 to save the signs
    cvtdq2ps    xmm1,xmm3                   ; (4 packed int32 to 4 packed spfp) save the  integral parts
    subps       xmm0,xmm1                   ; now it's inside the range, results are values between -0.5 to 0.4999999
    pslld       xmm3,31                     ; put sign-bits in position, to place values in the right hemispheres
    xorps       xmm0,xmm3                   ; set sign-bits
    mulps       xmm0,oword ptr Pi           ; restore ranges between -1/2 pi to +1/2 pi

; Now do the Chebyshev approximation of a 9th degree polynomial
; With 4 optimized constants for a maximum error of about 3.3381e-9 over -1/2 pi to +1/2 pi

    movaps      xmm2,xmm0
    mulps       xmm2,xmm2

    movaps      xmm1,oword ptr ChebyChf43
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr ChebyChf42
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr ChebyChf41
    mulps       xmm1,xmm2
    addps       xmm1,oword ptr ChebyChf40
    mulps       xmm2,xmm0
    mulps       xmm1,xmm2
    addps       xmm0,xmm1
    ret

SSE2sinSPf endp
    


Last edited by Siekmanski on 11 Sep 2015, 10:22; edited 1 time in total
Post 11 Sep 2015, 08:39
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20451
Location: In your JS exploiting you and your system
revolution 11 Sep 2015, 08:50
Siekmanski:

Is it only single precision?

What is the accuracy?

What is the input range?
Post 11 Sep 2015, 08:50
View user's profile Send private message Visit poster's website Reply with quote
Siekmanski



Joined: 18 Jun 2006
Posts: 12
Location: The Netherlands
Siekmanski 11 Sep 2015, 09:38
Yes, this one calculates the sine of 4 single floating point values at once. (same code can be used to calculate also 1 or 2 or 3 values at once)
9 digits precision, enough for some purposes and it's fast.

The code keeps the input value range between -1/2 and +1/2 PI, so the input can be large but the larger the input value the more precision you'll loose.
Just try it out.

A double precision example,
It calculates the sine of 2 double floating point values at once:


Code:
.data

align 16
OneDivPiDP  real8 2 dup (0.31830988618379067153776752674)
PiDP        real8 2 dup (3.14159265358979323846264338327)

ChebyChf0   real8 2 dup (-0.16666657096504701458241294000)
ChebyChf1   real8 2 dup (0.00833301729156221812798629161)
ChebyChf2   real8 2 dup (-0.00019806615201350805044116296)
ChebyChf3   real8 2 dup (0.00000260005476789036127712325)

.code

align 16
SSE2sinDPf proc                             ; Calculate Sine of 1 double or 2 packed double floating point values ( in: xmm0, result: xmm0 )
    
    mulpd       xmm0,oword ptr OneDivPiDP   ; 1/pi to get a 1pi range
    cvtpd2dq    xmm3,xmm0                   ; (2 packed dpfp to 2 int32) lose the fractional parts and keep it in xmm3 to save the signs
    cvtdq2pd    xmm1,xmm3                   ; (2 packed int32 to 2 dpfp) save the  integral parts
    subpd       xmm0,xmm1                   ; now it's inside the range, results are values between -0.5 to 0.4999999
    pslld       xmm3,31                     ; put sign-bits in position, to place values in the right hemispheres
    pshufd      xmm3,xmm3,Shuffle(1,3,0,2)  ; shuffle the sign-bits into place         
    xorpd       xmm0,xmm3                   ; set sign-bits
    mulpd       xmm0,oword ptr PiDP         ; restore ranges between -1/2 pi to +1/2 pi

; Now do the Chebyshev approximation of a 9th degree polynomial
; With 4 optimized constants for a maximum error of about 3.3381e-9 over -1/2 pi to +1/2 pi

    movapd      xmm2,xmm0
    mulpd       xmm2,xmm2

    movapd      xmm1,oword ptr ChebyChf3
    mulpd       xmm1,xmm2
    addpd       xmm1,oword ptr ChebyChf2
    mulpd       xmm1,xmm2
    addpd       xmm1,oword ptr ChebyChf1
    mulpd       xmm1,xmm2
    addpd       xmm1,oword ptr ChebyChf0
    mulpd       xmm2,xmm0
    mulpd       xmm1,xmm2
    addpd       xmm0,xmm1
    ret
SSE2sinDPf endp
    


Last edited by Siekmanski on 11 Sep 2015, 12:25; edited 1 time in total
Post 11 Sep 2015, 09:38
View user's profile Send private message Reply with quote
Roman



Joined: 21 Apr 2012
Posts: 1848
Roman 11 Sep 2015, 11:50
Siekmanski
Great example !!!
You cool !!! Smile

But your example not faster than my example.
I test on my CPU and not get great result.
My example i run 1000000 and got 10
And i run your example 1000000 and got 10~25 !!!
Code:
        invoke    timeBeginPeriod,1
        invoke  timeGetTime
        mov     dword [chVl],eax 
do cycle 1000000 Call FastSinus
                               invoke  timeGetTime
                               mov     ebx,dword [chVl]
                               sub     eax,ebx  
    

In eax = all time
Post 11 Sep 2015, 11:50
View user's profile Send private message Reply with quote
Roman



Joined: 21 Apr 2012
Posts: 1848
Roman 11 Sep 2015, 12:13
Siekmanski
O ! Sorry
I understend. Your example at once calculating four sinus(4 float32) !!!
Nice ! Very nice !!!

If i right real4 for Fasm DD.
DD 4 dup (0.31830988618379067153776752674)
Why need so long number for 32 bit number ?
Post 11 Sep 2015, 12:13
View user's profile Send private message Reply with quote
Roman



Joined: 21 Apr 2012
Posts: 1848
Roman 11 Sep 2015, 13:23
Code:
movaps      xmm0,dqword [FourRadians]
    mulps       xmm0,dqword [OneDivPi]      ; 1/pi to get a 1pi range
    cvtps2dq    xmm3,xmm0                   ; (4 packed spfp to 4 packed int32) lose the fractional parts and keep it in xmm3 to save the signs 
    cvtdq2ps    xmm1,xmm3                   ; (4 packed int32 to 4 packed spfp) save the  integral parts 
    subps       xmm0,xmm1                   ; now it's inside the range, results are values between -0.5 to 0.4999999 
    pslld       xmm3,31                     ; put sign-bits in position, to place values in the right hemispheres 
    xorps       xmm0,xmm3                   ; set sign-bits 
    mulps       xmm0,dqword [Pi]            ; restore ranges between -1/2 pi to +1/2 pi

; Now do the Chebyshev approximation of a 9th degree polynomial 
; With 4 optimized constants for a maximum error of about 3.3381e-9 over -1/2 pi to +1/2 pi 

    movaps      xmm2,xmm0 
    mulps       xmm2,xmm2 

    movaps      xmm1,dqword [ChebyChf43]
    mulps       xmm1,xmm2 
    addps       xmm1,dqword [ChebyChf42]
    mulps       xmm1,xmm2 
    addps       xmm1,dqword [ChebyChf41]
    mulps       xmm1,xmm2 
    addps       xmm1,dqword [ChebyChf40]
    mulps       xmm2,xmm0 
    mulps       xmm1,xmm2 
    addps       xmm0,xmm1
    movaps      dqword [Sin1234],xmm0


     align 16
     OneDivPi    dd 4 dup (0.3183098861837)
     Pi          dd 4 dup (3.1415926535897)
     ChebyChf40  dd 4 dup (-0.1666665709650) 
     ChebyChf41  dd 4 dup (0.0083330172915) 
     ChebyChf42  dd 4 dup (-0.0001980661520)
     ChebyChf43  dd 4 dup (0.0000026000547) 
     FourRadians dd  0.174532925,1.570796325,0.0174532925,1.570796325 ;four numbers radian
     Sin1234 dd 0,0,0,0 ;result four sinus
    
Post 11 Sep 2015, 13:23
View user's profile Send private message Reply with quote
Siekmanski



Joined: 18 Jun 2006
Posts: 12
Location: The Netherlands
Siekmanski 11 Sep 2015, 13:35
ChebyChf40 real4 4 dup (-0.16666657096504701458241294) ;Masm
ChebyChf40 dd -0.16666657096504701458241294,-0.16666657096504701458241294,-0.16666657096504701458241294,-0.16666657096504701458241294 ; fasm

You are right it's too much precision for 32 bit ( Masm accepts it... ) I use the same constant-values for 32, 64, and 80 bit.
Post 11 Sep 2015, 13:35
View user's profile Send private message Reply with quote
RIxRIpt



Joined: 18 Apr 2013
Posts: 50
RIxRIpt 12 Sep 2015, 11:33
revolution wrote:
RIxRIpt:

Is it only double precision?

What is the accuracy?

What is the input range?

1. Single precision only, as you can see cvtss2sd xmm0, xmm0 in the bottom of the repeat block.
2. I took this code some time ago from intel's approx. math lib, cannot find it any more, so I'm not sure about accuracy. I guess it's about 10e-4
3. Any

Roman,
intel i7-2600 @ 3.40GHz
I wanted to say that it's about 5 times faster than fsin, not yours fastsin version.

_________________
Привет =3
Admins, please activate my account "RIscRIpt"
Post 12 Sep 2015, 11:33
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20451
Location: In your JS exploiting you and your system
revolution 12 Sep 2015, 12:02
I question those that say the input range is "32-bit float" or "any". What is the sine of infinity? What is the sine of NaN? What is the sine of 2^65? What about handling of denormals? Does it take degrees or radians (or some other format)? How does the accuracy change for input values close to PI, or zero?

I guess some might read those questions as pedantic, but for these types of functions we should know the limitations and tradeoffs we are making. We can't simply compare based upon number-of-clock-cycles-to-run because the input and output might be of insufficient range, precision or accuracy to suit our needs.
Post 12 Sep 2015, 12:02
View user's profile Send private message Visit poster's website Reply with quote
pabloreda



Joined: 24 Jan 2007
Posts: 116
Location: Argentina
pabloreda 12 Sep 2015, 19:40
for the record

I use 16.16 fixed point.
for speed, use 1.0 for represent 1 turn/360 degree. with this avoid division and modulus.
The main rutine is not mine, I copy from code (I guess from Chuck Moore) and I convert from radian to 1.0 (the first multiplication)
work ok with 3d graphics.
I guess can avoid first multiplication recalculating the other constant but I don't know how.

this is the rutine compiled:
parameter in in eax, output in eax

Code:
w3:  ; --- sinp --- ;dD:0 dU:-1 dR:-1 (4)
and eax,$7FFF
sub eax,$3FFF
mov ebx,$64879
cdq
imul ebx
shrd eax,edx,$10
xchg eax,eax
cdq
imul eax
shrd eax,edx,$10
mov ebx,$1F2
mov ecx,eax
cdq
imul ebx
shrd eax,edx,$10
sub eax,$2A82
cdq
imul ecx
shrd eax,edx,$10
add eax,$10000
cdq
imul ecx
shrd eax,edx,$10
ret

w4:  ; --- cos --- ;dD:0 dU:-1 dR:-1 (2)
add eax,$7FFF
test eax,$8000
jz w3
call w3 ; sinp
neg eax
ret

w5:  ; --- sin --- ;dD:0 dU:-1 dR:-1 (2)
add eax,$3FFF
test eax,$8000
jz w3
call w3 ; sinp
neg eax
ret
    


The source code in FORTH like languaje, :R4
The number with . are converted to fixed point.

Code:
:sinp
        $7fff and $3fff -
        6.2831 *. | radianes->bangle
        dup dup *. | f sqr
        dup 498 *. | f sqr resul
        10882 - *.
        1.0 +
        *. ;

::cos | v -- r
        $7fff + $8000 nand? ( sinp ; ) sinp neg ;
::sin | v -- r
        $3fff + $8000 nand? ( sinp ; ) sinp neg ;
    
Post 12 Sep 2015, 19:40
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.