flat assembler
Message board for the users of flat assembler.

Index > Main > Need input: sin function using FPU

Author
Thread Post new topic Reply to topic
mattst88



Joined: 12 May 2006
Posts: 260
Location: South Carolina
mattst88 26 Jan 2007, 01:34
Code:
sin:
        fld     [x]     ; x
        fmul    st0,st0 ; x^2
        fld     [rf9]   ; 1/9! x^2
        fmul    st0,st1 ; x^2/9! x^2
        fsub    [rf7]   ; (-1/7! + x^2/9!) x^2
        fmul    st0,st1 ; (-x^2/7! + x^4/9!) x^2
        fadd    [rf5]   ; (1/5! - x^2/7! + x^4/9!) x^2
        fmul    st0,st1 ; (x^2/5! - x^4/7! + x^6/9!) x^2
        fsub    [rf3]   ; (-1/3! + x^2/5! - x^4/7! + x^6/9!) x^2
        fmulp   st0,st1 ; (-x^2/3! + x^4/5! - x^6/7! + x^8/9!) x^2
        fmul    [x]     ; (-x^3/3! + x^5/5! - x^7/7! + x^9/9!)
        fadd    [x]     ; (x - x^3/3! + x^5/5! - x^7/7! + x^9/9!)
        fstp    [x]     ;

rf9     dq      2.7557319223985890651862166557528e-6
rf7     dq      0.0001984126984126984126984126984127
rf5     dq      0.0083333333333333333333333333333333
rf3     dq      0.16666666666666666666666666666667

x       dq      ?    


I wrote this after figuring out Goplat's awesome cosine function he wrote in this thread.

I've never figured out how you people count the clocks experimentally so if someone would like to A) do it and tell me the results or B) show me how I'd be very appreciative.

Any errors? Input is more thank welcome.

Thanks guys Smile

_________________
My x86 Instruction Reference -- includes SSE, SSE2, SSE3, SSSE3, SSE4 instructions.
Assembly Programmer's Journal
Post 26 Jan 2007, 01:34
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 26 Jan 2007, 03:07
You can get an idea from http://www.agner.org/optimize/testp.zip and a recent thread http://board.flatassembler.net/topic.php?t=6563 (but much better first link Very Happy)
Post 26 Jan 2007, 03:07
View user's profile Send private message Reply with quote
MCD



Joined: 21 Aug 2004
Posts: 602
Location: Germany
MCD 26 Jan 2007, 05:38
You may view this old Fasm thread where I have posted an interesting algorithm, it's based of combining values of a LUT with some less further Taylor refinements. Also an examples of a very fast double and single precision cosine and sine function done with SSE (maybe not what you need) is included
see about in the middle of the page http://board.flatassembler.net/topic.php?t=3841

I especially mean this
Code:
cos(a) = cos( a-round(a)   +   round(a) ) 
  
using this trigonometric equation cos(a+b) = cos(a)*cos(b) - sin(a)*sin(b) 
 
=> cos(a-round(a)) * costab[round(a)] - sin(a-round(a)) * sintab[round(a)] 
    
Post 26 Jan 2007, 05:38
View user's profile Send private message Reply with quote
Jack



Joined: 16 Feb 2005
Posts: 21
Jack 26 Jan 2007, 11:43
you should use a minimax approximation instead of taylor series, for example the following approximation is accurate to six digits in the interval -Pi/2 .. Pi/2
Code:
.99999661590855105827518776*x- .16664828382099226703400879*x^3+ .0083063252288911864431576062*x^5- .00018363654014537635070098734*x^7
    
Post 26 Jan 2007, 11:43
View user's profile Send private message Reply with quote
MCD



Joined: 21 Aug 2004
Posts: 602
Location: Germany
MCD 26 Jan 2007, 18:10
Jack wrote:
you should use a minimax approximation instead of taylor series, for example the following approximation is accurate to six digits in the interval -Pi/2 .. Pi/2
Code:
.99999661590855105827518776*x- .16664828382099226703400879*x^3+ .0083063252288911864431576062*x^5- .00018363654014537635070098734*x^7
    
ahh, you mou meant something like minimizing the maximum difference between the approximated function and the approximation polynom instead of juat Taylor polynom? Well, I have simply used the Taylor one cause it's easier to explain.

But you're right, those other polynoms are even more effective than the taylor ones cause you try to get the best approximation in an interval(like [0,2*pi]), and not only in 1 point like 0.

But best would be to combine those minimax polynom with the algorithm I have posted above.
I meane decomposing the sin/cosine into a bigger LUT part and into a smaller remainder that is passed to the minimax polynom which must then be set up to settle down very quickly in a different but smaller intervall, e.g. not [0,2*pi] but rather [0,d] with d very close to 0.
The resulting polynom will then be almost same as the taylor one, but only with very little change in parameters, but these small changes are what makes the difference!

Finally I got a SSE float example
Code:
CosSSE:
        movss   xmm0,[esi]
        mulss   xmm0,[Rad2LUT]
        cvtss2si        eax,xmm0
        cvtsi2ss        xmm1,eax
        and     eax,0FFh
        shl     eax,4
        subss   xmm0,xmm1
        mulss   xmm0,[LUT2Rad]
        movaps  xmm1,xmm0
        mulss   xmm0,xmm0
        movlhps xmm0,xmm0
        mulss   xmm0,xmm1
        movhps  xmm1,[_1.0]
        mulps   xmm0,[SinCosCoeff]
        addps   xmm0,xmm1
        mulps   xmm0,[SinCosLUT+eax]
        movhlps xmm1,xmm0
        subss   xmm1,xmm0
        movss   [edi],xmm1
        ret

align   4
Rad2LUT dd      40.743665  ;256/(2*pi)
LUT2Rad dd      0.024543692;2*pi/256
_1.0:
        dd      1.0,0

align   16
SinCosCoeff:
        dd      -0.16666230,0, -0.49997921,0;~ -1/3!, ~ -1/2!
SinCosLUT:
        rb      256*4*4;Have to be setup before first use
;The format of the LUT is as follows:
;[Single sine(0*(2pi/256))] [0] [Single cosine(0*(2pi/256))] [0]
; 0                          4   8                            0Ch
;[Single sine(1*(2pi/256))] [0] [Single cosine(1*(2pi/256))] [0]
; 10h                        14h 18h                          1Ch
;               . . .                   . . .
    

I know this math stuff isn't easy if you aren't into that, anyway, that task/algorithm could be clarified very well with a function-plot, but I'm too lazy for that now

p.s. this code is over a year old, so it's kinda unmainted, but it workeed at the time I wrote it

_________________
MCD - the inevitable return of the Mad Computer Doggy

-||__/
.|+-~
.|| ||


Last edited by MCD on 28 Jan 2007, 13:16; edited 1 time in total
Post 26 Jan 2007, 18:10
View user's profile Send private message Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2341
Location: Usono (aka, USA)
rugxulo 28 Jan 2007, 04:23
LocoDelAssembly wrote:
You can get an idea from http://www.agner.org/optimize/testp.zip and a recent thread http://board.flatassembler.net/topic.php?t=6563 (but much better first link Very Happy)


I would not recommend downloading that, especially because of PMCDOS.ZIP (ugh, that Agner guy just can't take a hint).
Post 28 Jan 2007, 04:23
View user's profile Send private message Visit poster's website Reply with quote
mattst88



Joined: 12 May 2006
Posts: 260
Location: South Carolina
mattst88 28 Jan 2007, 19:14
rugxulo wrote:
LocoDelAssembly wrote:
You can get an idea from http://www.agner.org/optimize/testp.zip and a recent thread http://board.flatassembler.net/topic.php?t=6563 (but much better first link Very Happy)


I would not recommend downloading that, especially because of PMCDOS.ZIP (ugh, that Agner guy just can't take a hint).


Elaborate please.

_________________
My x86 Instruction Reference -- includes SSE, SSE2, SSE3, SSSE3, SSE4 instructions.
Assembly Programmer's Journal
Post 28 Jan 2007, 19:14
View user's profile Send private message Visit poster's website Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2341
Location: Usono (aka, USA)
rugxulo 28 Jan 2007, 21:06
mattst88, let's just say that neither Richard Stallman nor Linus Torvalds would EVER approve, m'kay? DON'T DO IT!
Post 28 Jan 2007, 21:06
View user's profile Send private message Visit poster's website Reply with quote
FrozenKnight



Joined: 24 Jun 2005
Posts: 128
FrozenKnight 31 Jan 2007, 21:32
to test clocks i use the following code
Code:
  sub   esp, 8
  RDTSC
  mov   [esp], eax
  mov   [esp+4], edx     
then when my code segment is done i use
Code:
  RDTSC
  sub   eax, [esp]
  sbb   edx, [esp+4]
  add   esp, 8    
a few clock cycles are added because of the code involved but generally this will tell you about how much time it took to run a code segment. be warned you should limit the code to one cpu for proper testing. and try to keep the stack properly balanced within the tested code it adjust my code for you stack.
if everything works then the result will be stored in edx:eax.
Post 31 Jan 2007, 21:32
View user's profile Send private message Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8263
Location: Kraków, Poland
Tomasz Grysztar 31 Jan 2007, 21:53
You may also find this interesting.
Post 31 Jan 2007, 21:53
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2023, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.