flat assembler
Message board for the users of flat assembler.

Index > Main > FAST COSINE DOUBLE PRECISION FP

Goto page 1, 2, 3  Next
Author
Thread Post new topic Reply to topic
r22



Joined: 27 Dec 2004
Posts: 805
r22
Code:
;IN: ptr to qword double precision FP value in radians
;OUT: value of in ptr replaced with Cosine
align 16  ;Lou Ricci, SSE Cosine, for the Community
CosSSE8_2:  ;;parallelism
     mov eax,[esp+4] ;note: unrolling makes SSE slower on P4
     MOVSD xmm0,qword[eax]  ;\
     MOVHPD xmm0,qword[eax] ;SSE3 MOV-DUP
     MOVDQA xmm1,xmm0 ; HI  LO
     MULPD xmm0,xmm1 ; x^2 | x^2
     MULPD xmm1,xmm1 ; x^2 | x^2
     MULSD xmm0,xmm1 ; x^2 | x^4
     MULPD xmm1,xmm1 ; x^4 | x^4
     MOVDQA xmm3, dqword[CosOneP] ; 1.0 | 0.0
     MOVDQA xmm2,xmm0 ;copy for scaling x^'s with xmm1
     mov ecx,-16*6 ;12 iterations (6*2)
  .LP:
     MULPD xmm0, dqword[CosTableP+16*6+ecx] ;x^? / (?)!
     ADDPD xmm3,xmm0    ;1 - + - + -...
     MULPD xmm2,xmm1    ;x^? | x^? TO x^(?+4) | x^(?+4)
     MOVDQA xmm0,xmm2
     add ecx,16
     js .LP
     ;add high qword and low qword of xmm3 for result
     ;err no SSE3 horizontal add yet so, cheap work around
     ;SSE3 HADDPD
     MOVHPD qword[eax],xmm3
     ADDSD xmm3,qword[eax]
     MOVSD qword[eax],xmm3
     ret 4
align 16
CosOneP dq 1.0, 0.0
align 16 ;Table of 1/4! | 1/2!, 1/8! | 1/6! ...
CosTableP dq 0.041666666666666666666666666666667, -0.5
         dq 2.4801587301587301587301587301587e-5, -0.0013888888888888888888888888888889
         dq 2.0876756987868098979210090321201e-9, -2.7557319223985890652557319223986e-7
         dq 4.7794773323873852974382074911175e-14, -1.1470745597729724713851697978682e-11
         dq 4.1103176233121648584779906184361e-19, -1.5619206968586226462216364350057e-16
         dq 1.6117375710961183490487133048012e-24, -8.8967913924505732867488974425025e-22
         dq 3.2798892370698379101520417273121e-30, -2.479596263224797460074943545848e-27
    


The above code is faster than the FPU
You can scale precision (and speed of execution) by
shortening the loop iterations (ie change 16*6 to 16*5).
Post 23 Jul 2005, 20:31
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
tom tobias



Joined: 09 Sep 2003
Posts: 1320
Location: usa
tom tobias
Thank you r22 for this interesting example of using SSE cosine rather than Floating Point cosine. You indicate that this method is "...faster than the FPU", which raises a couple of questions:
1. How much faster?
2. Is it also faster than a slightly LESS precise cosine, for example, a chart of values such as you have included, i.e. calculated ahead of time, and then simply (and elegantly) using a look up table to "compute" the value of the unknown. Yes, this is a bit muddled, lets try again. What happens if, instead of using the SSE instructions to compute the "DOUBLE PRECISION" cosine of a number, you simply built a large chart containing all the doubly precise values, and used conventional 32 bit integers for input values to serve as indices to the (large) look up table? Wouldn't that be just as fast?
(I think it would be faster!--skip the SSE AND fpu altogether.)
3. Why do you need a "double precision" cosine? Surely you are not writing code using this cpu to travel to outer space, where a couple of thousandths of arc seconds of error could result in catastrophe! For ordinary engineering applications, I am struggling to think of a desktop cpu application that requires such precision.... Smile
Post 23 Jul 2005, 21:19
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 16901
Location: In your JS exploiting you and your system
revolution
tom tobias: When I was doing ray tracing some years back I found that single precision SIN and COS would create small flaws in the output, I had to change to double precision to give nice results.

Look up tables can be useful but sometimes they gives problems with cache evictions etc. It all depends on your application.

Thanks for the nice code r22.
Post 24 Jul 2005, 00:59
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
1. Using 6 iterations (the exact code posted above) it's 1.5seconds faster over the course of 0xFFFFFF function calls.

2. No, using less iterations (say 3 instead of 6) will give a less exact result faster than a huge lookup table. The most double precision FP values you can fit in one page (4kb) is 512. Any more and you'll encounter (as Rev. said) caching stalls.

3. It's meant to be a replacement for the FPU. The function can be easily ported to native 64bit code and if you are not aware 64bit apps use SIMD instead of the FPU for floating point calculations.

I personally have no use for it, I thought it was interesting to code so I did.
Post 24 Jul 2005, 01:56
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 16901
Location: In your JS exploiting you and your system
revolution
If I make the changes as below it runs 5% faster on my Pentium M
Code:
     pshufd xmm1,xmm3,0eeh   ;     MOVHPD qword[eax],xmm3
     addsd xmm3,xmm1         ;     ADDSD xmm3,qword[eax]
    
Post 24 Jul 2005, 05:28
View user's profile Send private message Visit poster's website Reply with quote
tom tobias



Joined: 09 Sep 2003
Posts: 1320
Location: usa
tom tobias
r22 wrote:
... The most double precision FP values you can fit in one page (4kb) is 512. Any more and you'll encounter (as Rev. said) caching stalls.
...

Perhaps I simply misunderstand.
I was under the impression, perhaps incorrectly, that commencing with Pentium, the "new" register, CR4 contained a bit, the PSE bit, which permitted page sizes, if set, of four megabytes, and if clear, the old standard, which you reference, i.e. four kilobytes.
Cache delays are another problem altogether, so far as I am aware. Perhaps you can educate me, by providing a reference to the relationship between large page sizes and cache delays? Thanks, tom Smile
P.S. Thanks for indicating that the time increase was 1.5 seconds for execution of the same task using FP rather than SSE instructions. Umm, how many seconds then, for either task, so that we can learn about the %loss associated with calculating double precision transcendentals via FP?
Post 24 Jul 2005, 17:04
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 16901
Location: In your JS exploiting you and your system
revolution
tom tobias: I know you did not quote me above but I want to clarify what I said previously about the cache. Depending on the particular application you are running using a table might make things slower because of cache eviction and thrashing. Now, let's say our core module doing most of the intense calculation is 20kB and we use a look-up table for COS/SIN using another 16kB. My Pentium M laptop has 32kB L1 cache so in this situation something has to be pushed out of L1 cache when we run the code. But if instead we use a loop to calculate COS/SIN that is only a few hundred bytes we can fit more of other code into the L1 cache. But, as I said, it all depends on your application. In a different situation a table might be just perfect and give a great speed-up.

As for the page size of 4kB or 4MB, I doubt if that has much effect on the overall runtime speed. The cache and page mapping mechanisims are separate parts within the processor.

Also: when I did a quick test with r22's code above and the standard FCOS I found the timing on my Pent-M like this:

FCOS: 1.0
r22: 0.76
r22 with my mod above: 0.72

The times shown are normalised. I did 256meg iterations for each and I expect all code and data was in the L1 cache during the test.
Post 25 Jul 2005, 01:46
View user's profile Send private message Visit poster's website Reply with quote
MCD



Joined: 21 Aug 2004
Posts: 604
Location: Germany
MCD
Quote:
The most double precision FP values you can fit in one page (4kb) is 512
Yes, but the L1 cache is usually larger than a 4kB page. And as revolution mentioned, the paging and caching part of the CPU are rather independent, so larger look-up tables are possible, about 8kb-32kb, which would give 1024-4096 double precision FP values. Well, for some purposes, this isn't precise enough, so the best speed/quality approach for me seems to combine a medium size look-up table with only a few iterations to refine the result. This is easy to do for Newton-styled iterations, because they convert very fast (square) to the result and discarding errors, but I'm just trying to figure out how this "refinement" would work with taylor-based iterations. This requires some math-works.[/code]

_________________
MCD - the inevitable return of the Mad Computer Doggy

-||__/
.|+-~
.|| ||
Post 26 Jul 2005, 13:34
View user's profile Send private message Reply with quote
MCD



Joined: 21 Aug 2004
Posts: 604
Location: Germany
MCD
Can't this
r22 wrote:
Code:
     MOVSD xmm0,qword[eax]  ;\
     MOVHPD xmm0,qword[eax] ;SSE3 MOV-DUP
    

be simplyfied to
Code:
     MOVSD xmm0,qword[eax]
     MOVLHPS xmm0,xmm0
    
? (Don't care about the PS, it's a binary copy either)

And another thing. Every alternating taylor iteration a can be refined by simply averaging a_(n-1) and a_n which always results in dividing the very last coefficient by 2. I was about to write this formula with some Sigma symbol, but was too lazy.

So, if you really need to take all the iterations you mentioned, the the last value in your cosine coefficient table would be:
Code:
dq 1,6399446185349189550760208636561e-30 , -2,479596263224797460074943545848e-27    


Only disadvantage is that this way, it's harder to continue the iteration if you want to do so one day.
Post 27 Jul 2005, 08:03
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 16901
Location: In your JS exploiting you and your system
revolution
Quote:
be simplyfied to

Code:

MOVSD xmm0,qword[eax]
MOVLHPS xmm0,xmm0


I tried:
Code:
     MOVSD xmm0,qword[eax] 
     PSHUFD xmm0,xmm0,044h    

But it was slower. Don't know why.
Post 27 Jul 2005, 08:52
View user's profile Send private message Visit poster's website Reply with quote
MCD



Joined: 21 Aug 2004
Posts: 604
Location: Germany
MCD
revolution wrote:
Quote:
be simplyfied to

Code:

MOVSD xmm0,qword[eax]
MOVLHPS xmm0,xmm0


I tried:
Code:
     MOVSD xmm0,qword[eax] 
     PSHUFD xmm0,xmm0,044h    

But it was slower. Don't know why.

That's hard to guess why. On which CPU have you tested it?
Maybe this is because it has an immediate. Or what is more likely is, that PSHUFD is 1 byte longer, because it requires a 66h prefix byte, whereas movlhps doesn't.

Back to the original code. Can't r22 just replace
Code:
     MOVHPD qword[eax],xmm3 
     ADDSD xmm3,qword[eax] 
     MOVSD qword[eax],xmm3
    

with
Code:
     MOVHLPS xmm0/1/2,xmm3 
     ADDSD xmm3,xmm0/1/2;means any of xmm0 xmm1 or xmm2 
     MOVSD qword[eax],xmm3
    
? was that hard to guess? Anyway, the qword [eax] would be almost for 100% sure be in L1 cache.

Anyway, here is the complete cosine routine I have rewritten
Code:
align   16
CosSSE8_2:
        mov     eax,[esp+4]     ;Load eax before uding it
        mov     ecx,-16*6       ;Load ecx before using it
        movsd   xmm0,qword[eax]
        movlhps xmm0,xmm0
        mulpd   xmm0,xmm0       ;1 FP mul eleminated Smile And try to put..
        movdqa  xmm1,xmm0
        mulsd   xmm0,xmm0       ;..some space between...
        movdqa  xmm2,xmm0
        mulpd   xmm1,xmm1       ;...multiple multiplications
        movdqa  xmm3,dqword [CosOne]
    .CosLoop:
        mulpd   xmm0,dqword [CosTab+ecx+16*6]
        addpd   xmm3,xmm0
        mulpd   xmm2,xmm1
        movdqa  xmm0,xmm2
        add     ecx,16
        jnz     .CosLoop        ;Maybe you prefer js or jnc
;SSE3 HADDPD
        movhlps xmm0,xmm3
        addsd   xmm3,xmm0
        movsd   qword [eax],xmm3
        ret     4
align 16
CosOne  dq 1.0, 0.0
CosTab  dq 4.166666666666667e-2 , -5.000000000000000e-1;I guess 16 digits are enough for 64bit fp
        dq 2.480158730158730e-5 , -1.388888888888888e-3
        dq 2.087675698786810e-9 , -2.755731922398589e-7
        dq 4.779477332387385e-14, -1.147074559772972e-11
        dq 4.110317623312165e-19, -1.561920696858623e-16
        dq 1.611737571096118e-24, -8.896791392450573e-22
        dq 1.639944618534919e-30, -2.479596263224797e-27
    

_________________
MCD - the inevitable return of the Mad Computer Doggy

-||__/
.|+-~
.|| ||
Post 27 Jul 2005, 13:35
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
Shuffling the MOVDQA's squeaks out a few ticks faster ~5%
Putting one between the MOVLHPS and the MULPD

Also using 5 iterations, 6 doesn't seem to improve the accuracy of the results (this could be the visual studio 6 printf api I'm using not sure)

Code:
align 16 ;fASM Community SSE cosine
CosSSE8_2:  ;;parallelism
     mov eax,[esp+4]
     mov ecx,-16*5 ;10 iterations (5*2)
     MOVSD xmm0,qword[eax]  ;\
     MOVLHPS xmm0,xmm0;qword[eax] ;SSE3 MOV-DUP
     MOVDQA xmm3, dqword[CosOneP] ; 1.0 | 0.0
     MULPD xmm0,xmm0 ;x^2 x^2
     MOVDQA xmm1,xmm0 ; HI  LO
     MULSD xmm0,xmm0 ; x^2 | x^4
     MOVDQA xmm2,xmm0 ;copy for scaling x^'s with xmm1
     MULPD xmm1,xmm1 ; x^4 | x^4
  .LP:
     MULPD xmm0, dqword[CosTableP+16*5+ecx] ;x^? / (?)!
     ADDPD xmm3,xmm0    ;1 - + - + -...
     MULPD xmm2,xmm1    ;x^? | x^? TO x^(?+4) | x^(?+4)
     MOVDQA xmm0,xmm2
     add ecx,16
     js .LP
     ;add high qword and low qword of xmm3 for result
     ;err no SSE3 horizontal add yet so, cheap work around
     ;SSE3 HADDPD
     MOVHLPS xmm1,xmm3;,0eeh ;didn't think of it
     addsd xmm3,xmm1       ;suggested by: revolution
     MOVSD qword[eax],xmm3
     ret 4
align 16
CosOneP dq 1.0, 0.0
align 16 ;Table of 1/4! | 1/2!, 1/8! | 1/6! ...
CosTableP dq 0.041666666666666666666666666666667, -0.5
         dq 2.4801587301587301587301587301587e-5, -0.0013888888888888888888888888888889
         dq 2.0876756987868098979210090321201e-9, -2.7557319223985890652557319223986e-7
         dq 4.7794773323873852974382074911175e-14, -1.1470745597729724713851697978682e-11
         dq 4.1103176233121648584779906184361e-19, -1.5619206968586226462216364350057e-16
         dq 1.6117375710961183490487133048012e-24, -8.8967913924505732867488974425025e-22
         dq 3.2798892370698379101520417273121e-30, -2.479596263224797460074943545848e-27
    


Precision break down
Quote:

Test Values
---------------------------------------------------------------
Pi/2: 1.5707963267948966192313216916398
Pi/6: 0.52359877559829887307710723054658

Precision Of Results
Cos(Pi/2): Cos(Pi/6)
--------------------------------------------------------------
WinXP Calc.exe
0: 0.86602540378443864676372317075294
CosFPU (using printf api with a format string of %1.30f)
0.000000000000000061230317691119: 0.866025403784438600000000000000
CosSSE8_2 (using printf api with a format string of %1.30f)
0.000000000000000222044604925031: 0.866025403784438600000000000000


I think it's safe to say we optimized this function to death Razz
Can't wait to use the SSE3 opcodes and make a Single precision one for game engines (put it up against a LUT).

The setup cost might be a little high to get the xmm0 register filled with x^2 | x^4 | x^6 | x^8 to make the single precision version 2times more parallel. But it would only have to loop twice. Can't wait to buy a 4400+ in sept.
Post 27 Jul 2005, 17:08
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
MCD



Joined: 21 Aug 2004
Posts: 604
Location: Germany
MCD
Just some more generalizations: you can use the same code with only different table values to calculate other tylor-based functions, like sine, arcustangens, natural exponential and logarythm a.s.o.

Another thing I though was to optimize the cosine algorythm by passing smaller values to it, like this equation allows:

Code:
SQR(COS(a*0.5))*2 - 1 = COS(a)
    


naturally you can nest it, but I doubt that this isn't worth the effort in terms of precision/speed.


Second idea I got was to get the Newton-iteration for cosine (Get one lookup-table value and refine it with only a few, squarely approaching iterations) , but that one is too calculation time intensive and thus completely useless:
Code:
x_n+1 = x_n + (arccos(x) - a)*SqRt(1-x_n*x_n)
    

Useless to say that arccos and a square root in a newton iteration is not THAT fast Smile

Another approach is to combine a lookup table and use the taylor-based function code only to calculate the rest, like the following formula may allow:

Code:
COS(a) - COS(b) = -2 * SIN((a+b)*0.5) * SIN((a-b)*0.5)

let b=Trunc(a)

<=> COS(a) = COS(Trunc(a)) - SIN((a+Trunc(a))*0.5) * SIN((a-Trunc(a))*0.5) * 2
<=> COS(a) = COSTab[Trunc(a)] - SIN((a+Trunc(a))*0.5) * SIN((a-Trunc(a)*0.5) * 2  ;1.)
    

In 1.) COSTab[Trunc(a)] is a value given by a lookup table (very fast to compute) and SIN((a-Trunc(a)*0.5) should also be very fast to compute since it should be very close to 0. The only problem is SIN((a+Trunc(a))*0.5), which should be somehow rearranged to be also very close to 0. Not an easy task. Anyone an idea?

Hope that wasn't too much math at a time

_________________
MCD - the inevitable return of the Mad Computer Doggy

-||__/
.|+-~
.|| ||
Post 28 Jul 2005, 14:08
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
Ultrano used an FPU method for the sin function with a LUT http://board.win32asmcommunity.net/index.php?topic=21430.0
Post 28 Jul 2005, 16:58
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
MCD



Joined: 21 Aug 2004
Posts: 604
Location: Germany
MCD
Okay, I finally got a very fast and effective method to calculate (double precision) cosine by combining a LUT containint 256 both sine and cosine values for 1 cycle (2pi).

The principle is as following:

Code:
cos(a) = cos( a-round(a)   +   round(a) )
 
using this trigonometric equation cos(a+b) = cos(a)*cos(b) - sin(a)*sin(b)

=> cos(a-round(a)) * costab[round(a)] - sin(a-round(a)) * sintab[round(a)]
    

The idea behind all that is that both cos(a-round(a)) and sin(a-round(a)) should converge much faster that simple cos(a) calculation. Furthermore costab[a] and sintab[a] should also be very fast to compute.

And here goes the code:
Code:
CosDblSSE:
        movlps  xmm0,[esi];source value (angle in radiant)
        mulsd   xmm0,[Rad2LUT]
        cvtsd2si        eax,xmm0
        cvtsi2sd        xmm1,eax
        and     eax,0FFh
        shl     eax,4
        subsd   xmm0,xmm1
        mulsd   xmm0,[LUT2Rad]
        movlhps xmm0,xmm0
        mulsd   xmm0,xmm0
        movaps  xmm1,xmm0
        movaps  xmm2,xmm0
        movlhps xmm2,xmm2
        mulpd   xmm1,xmm2
        mulsd   xmm0,[_0.5]
        addsd   xmm0,[_1.0]
        mulpd   xmm2,xmm1
        mulpd   xmm1,[CosSinCoeff]
        addpd   xmm0,xmm1
        mulpd   xmm2,[CosSinCoeff+16]
        addpd   xmm0,xmm2
        mulpd   xmm0,[CosSinLUT+eax]
        movhlps xmm1,xmm0
        subsd   xmm0,xmm1
        movlps  [edi],xmm0;destination value
        ret

align   16
Rad2LUT dq      4.074366543152521e1 ;256/(2*pi)
LUT2Rad dq      2.454369260617026e-2;2*pi/256
_0.5    dq      0.5
_1.0    dq      1.0
CosSinCoeff:    ;COS, -SIN
        dq       4.166666666666667e-2, -1.666666666666667e-1; 1/4!, -1/3!
        dq      -1.388888888888889e-3,  8.333333333333333e-3;-1/6!,  1/5!
CosSinLUT:
        rb      256*8*2;Have to be setup before first use
;The format of the LUT is as follows:
;[Double cosine(0*(2pi/256))] [Double sine(0*(2pi/256))]
; 0                            8
;[Double cosine(1*(2pi/256))] [Double sine(1*(2pi/256))]
; 10h                          18h
;               . . .                   . . .
    

The code for initializitng the LUT is not included, but this can be done with simple FSINCOS. Unfortunately I don't have a CPU that supports SSE2/SSE3 instructions, so I was unable to test this procedure. Hopefully I have this SSE1 single precision equivalent cosine procedure:
Code:
CosSnglSSE:
        movss   xmm0,[esi]
        mulss   xmm0,[Rad2LUT]
        cvtsd2si        eax,xmm0
        cvtsi2sd        xmm1,eax
        and     eax,0FFh
        shl     eax,4
        subss   xmm0,xmm1
        mulss   xmm0,[LUT2Rad]
        movaps  xmm1,xmm0
        mulss   xmm1,xmm1
        movlhps xmm1,xmm1
        mulss   xmm1,xmm0
        movhps  xmm0,qword [_1.0]
        mulps   xmm1,[SinCosCoeff]
        addps   xmm0,xmm1
        mulps   xmm0,[SinCosLUT+eax] ; 1.)
        movhlps xmm1,xmm0
        subss   xmm1,xmm0 ;1.)
        movss   [edi],xmm1
        ret

align   16
Rad2LUT dd      40.743665  ;256/(2*pi)
LUT2Rad dd      0.024543692;2*pi/256
_1.0    dd      1.0,0
SinCosCoeff:
        dd      -0.16666667,0, -0.5,0;-1/3!, -1/2!
SinCosLUT:
        rb      256*4*4;Have to be setup before first use
;The format of the LUT is as follows:
;[Single sine(0*(2pi/256))] [0] [Single cosine(0*(2pi/256))] [0]
; 0                          4   8                            0Ch
;[Single sine(1*(2pi/256))] [0] [Single cosine(1*(2pi/256))] [0]
; 10h                        14h 18h                          1Ch
;               . . .                   . . .
    

Tests showed that this works like it should, except for angles close to pi+pi*2*n, where the result gets several unprecise digits. That's neither because the Taylor-cosine/sine calculations have to few steps nor the LUT is to small, the error happens at the very end of the calculation where all 4 intermidiate results are processed at the 1.) lines following this formula:

cos(a)*cos(b) - sin(a)*sin(b)

Unfortunately, these errors are caused be the nature of the floating point numbers itself, and since we can't defactorize this equation, there is nothing we can do except computing those last steps in double precision if we really want such precise results Sad

_________________
MCD - the inevitable return of the Mad Computer Doggy

-||__/
.|+-~
.|| ||
Post 01 Aug 2005, 14:11
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 16901
Location: In your JS exploiting you and your system
revolution
I get the following timing results with my test code:

FCOS: 1.000
r22 last suggestion: 0.6341
MCD 256 LUT: 0.6442 (accurate to 5 decimal places)

MCD: The result from your code only has 5 digits of accuracy, is that deliberate or my mistake? I used SSE2 and initialised the LUT with the following code:
Code:
init_CosSinLUT:
        mov     edi,CosSinLUT
        fld     qword[_2div256] ;2/256
        fldpi
        fmulp
        fldz
.a:     fld     st0
        fsincos
        fstp    qword[edi+0]
        fstp    qword[edi+8]
        fadd    st0,st1
        add     edi,16
        cmp     edi,CosSinLUT+16*256
        jb      .a
        fucompp
        ret    
What did I do wrong?
Post 01 Aug 2005, 15:48
View user's profile Send private message Visit poster's website Reply with quote
MCD



Joined: 21 Aug 2004
Posts: 604
Location: Germany
MCD
Your code seems to be correct. As well as mine. My code seems to have precision difficulties with input angles close to pi + n*2*pi with n is any valid integer. So please tell me the value you have tested the code with.

Well, and the issue of speed is also clear to me. My code is simply not that optimized yet, and I guess that it's still that slow because of lots of dependencies from previous instruction results. Someone should fix that.


edit: I've just tested my code (on another computer with SSE2) with the worst-case angle of pi, and my cos-procedure had still 8 correct decimal digits. Values further away from pi even have full double precision.
Post 02 Aug 2005, 11:39
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2145
Location: Estonia
Madis731
Great page on square-root if you are looking for it:
http://www.azillionmonkeys.com/qed/sqroot.html
There are integer and float code samples, they are in C and in ASM.
The page has changed since I last checked it but quick look tells me that the algorithms have remained. Of course a lot of additions are made.
Quote:

from Paul Hsieh's homepage:
... the fastest ISQRT() algorithm by far is to go through the FPU. Even with the change in FP control word it's quite fast. It looks like the P4 has an optimization for FLDCW that's better than what is in the Athlon. Since most applications use just two different CW values I speculate that they might be "caching" the last two CWs seen internally, i.e. this is like renaming limited to 2 rename registers.
Post 02 Aug 2005, 13:28
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 16901
Location: In your JS exploiting you and your system
revolution
MCD: I was testing with the neutral COS value +0.739085133215160642. When I use your code it settles at +0.739088549289849461.
Post 02 Aug 2005, 14:29
View user's profile Send private message Visit poster's website Reply with quote
MCD



Joined: 21 Aug 2004
Posts: 604
Location: Germany
MCD
Thanks for the testing. I missed a really stupid sign bug.
The _0.5 constant
Code:
_0.5    dq      0.5    

should be changed to
Code:
_Neg0.5 dq      -0.5    


These sign bugs are pretty nerved Sad
With some code rearrangements (maybe a bit faster/slower on your CPU), this should be it:
Code:
CosDblSSE:
        movlps  xmm0,[esi];source value (angle in radiant)
        mulsd   xmm0,[Rad2LUT]
        cvtsd2si        eax,xmm0
        cvtsi2sd        xmm1,eax
        subsd   xmm0,xmm1
        mulsd   xmm0,[LUT2Rad]
        movlhps xmm0,xmm0
        mulsd   xmm0,xmm0
        movaps  xmm1,xmm0
        movaps  xmm2,xmm0
        movlhps xmm2,xmm2
        mulpd   xmm1,xmm2
        and     eax,0FFh
        mulpd   xmm2,xmm1
        shl     eax,4
        mulpd   xmm1,[CosSinCoeff]
        mulpd   xmm2,[CosSinCoeff+16]
        addsd   xmm0,[_Neg2.0]
        mulsd   xmm0,[_Neg0.5]
        addpd   xmm0,xmm1
        addpd   xmm0,xmm2
        mulpd   xmm0,[CosSinLUT+eax]
        movhlps xmm1,xmm0
        subsd   xmm0,xmm1
        movlps  [edi],xmm0;destination value
        ret

align   16
Rad2LUT dq      4.074366543152521e1 ;256/(2*pi)
LUT2Rad dq      2.454369260617026e-2;2*pi/256
CosSinCoeff:    ;COS, -SIN
        dq       4.166666666666667e-2, -1.666666666666667e-1; 1/4!, -1/3!
        dq      -1.388888888888889e-3,  8.333333333333333e-3;-1/6!,  1/5!
_Neg2.0 dq      -2.0; just an optimization
_Neg0.5 dq      -0.5;fixed
CosSinLUT:
        rb      256*8*2;Have to be setup before first use
;The format of the LUT is as follows:
;[Double cosine(0*(2pi/256))] [Double sine(0*(2pi/256))]
; 0                            8
;[Double cosine(1*(2pi/256))] [Double sine(1*(2pi/256))]
; 10h                          18h
;               . . .                   . . .
    

Now it seems to work flawless, but still except for values close to pi + 2*n*pi Sad
Maybe precaching [COSSinLUT+eax] with prefetch instructions would speed this a little bit up.

Another, last big speed up can be achieved with this if you need to calculate both sine and cosine:
Code:
CosSinDblSSE:
        movlps  xmm0,[esi];source value (angle in radiant)
        mulsd   xmm0,[Rad2LUT]
        cvtsd2si        eax,xmm0
        cvtsi2sd        xmm1,eax
        subsd   xmm0,xmm1
        mulsd   xmm0,[LUT2Rad]
        movlhps xmm0,xmm0
        mulsd   xmm0,xmm0
        movaps  xmm1,xmm0
        movaps  xmm2,xmm0
        movlhps xmm2,xmm2
        mulpd   xmm1,xmm2
        and     eax,0FFh
        mulpd   xmm2,xmm1
        shl     eax,4
        mulpd   xmm1,[CosSinCoeff]
        mulpd   xmm2,[CosSinCoeff+16]
        addsd   xmm0,[_Neg2.0]
        mulsd   xmm0,[_Neg0.5]
        addpd   xmm0,xmm1
        addpd   xmm0,xmm2
        movlhps xmm1,xmm0
        movhlps xmm1,xmm0
        mulpd   xmm0,[CosSinLUT+eax]
        mulpd   xmm1,[CosSinLUT+eax]
        movhlps xmm2,xmm0
        movhlps xmm3,xmm1
        subsd   xmm0,xmm2
        addsd   xmm1,xmm3
        movlhps xmm0,xmm1
        movaps  [edi],xmm0;destination value: cos(angle), sin(angle)
        ret
    

_________________
MCD - the inevitable return of the Mad Computer Doggy

-||__/
.|+-~
.|| ||
Post 04 Aug 2005, 11:55
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2, 3  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2019, Tomasz Grysztar.

Powered by rwasa.