flat assembler
Message board for the users of flat assembler.
Index
> Main > FAST COSINE DOUBLE PRECISION FP Goto page 1, 2, 3 Next |
Author |
|
tom tobias 23 Jul 2005, 21:19
Thank you r22 for this interesting example of using SSE cosine rather than Floating Point cosine. You indicate that this method is "...faster than the FPU", which raises a couple of questions:
1. How much faster? 2. Is it also faster than a slightly LESS precise cosine, for example, a chart of values such as you have included, i.e. calculated ahead of time, and then simply (and elegantly) using a look up table to "compute" the value of the unknown. Yes, this is a bit muddled, lets try again. What happens if, instead of using the SSE instructions to compute the "DOUBLE PRECISION" cosine of a number, you simply built a large chart containing all the doubly precise values, and used conventional 32 bit integers for input values to serve as indices to the (large) look up table? Wouldn't that be just as fast? (I think it would be faster!--skip the SSE AND fpu altogether.) 3. Why do you need a "double precision" cosine? Surely you are not writing code using this cpu to travel to outer space, where a couple of thousandths of arc seconds of error could result in catastrophe! For ordinary engineering applications, I am struggling to think of a desktop cpu application that requires such precision.... |
|||
23 Jul 2005, 21:19 |
|
revolution 24 Jul 2005, 00:59
tom tobias: When I was doing ray tracing some years back I found that single precision SIN and COS would create small flaws in the output, I had to change to double precision to give nice results.
Look up tables can be useful but sometimes they gives problems with cache evictions etc. It all depends on your application. Thanks for the nice code r22. |
|||
24 Jul 2005, 00:59 |
|
r22 24 Jul 2005, 01:56
1. Using 6 iterations (the exact code posted above) it's 1.5seconds faster over the course of 0xFFFFFF function calls.
2. No, using less iterations (say 3 instead of 6) will give a less exact result faster than a huge lookup table. The most double precision FP values you can fit in one page (4kb) is 512. Any more and you'll encounter (as Rev. said) caching stalls. 3. It's meant to be a replacement for the FPU. The function can be easily ported to native 64bit code and if you are not aware 64bit apps use SIMD instead of the FPU for floating point calculations. I personally have no use for it, I thought it was interesting to code so I did. |
|||
24 Jul 2005, 01:56 |
|
revolution 24 Jul 2005, 05:28
If I make the changes as below it runs 5% faster on my Pentium M
Code: pshufd xmm1,xmm3,0eeh ; MOVHPD qword[eax],xmm3 addsd xmm3,xmm1 ; ADDSD xmm3,qword[eax] |
|||
24 Jul 2005, 05:28 |
|
tom tobias 24 Jul 2005, 17:04
r22 wrote: ... The most double precision FP values you can fit in one page (4kb) is 512. Any more and you'll encounter (as Rev. said) caching stalls. Perhaps I simply misunderstand. I was under the impression, perhaps incorrectly, that commencing with Pentium, the "new" register, CR4 contained a bit, the PSE bit, which permitted page sizes, if set, of four megabytes, and if clear, the old standard, which you reference, i.e. four kilobytes. Cache delays are another problem altogether, so far as I am aware. Perhaps you can educate me, by providing a reference to the relationship between large page sizes and cache delays? Thanks, tom P.S. Thanks for indicating that the time increase was 1.5 seconds for execution of the same task using FP rather than SSE instructions. Umm, how many seconds then, for either task, so that we can learn about the %loss associated with calculating double precision transcendentals via FP? |
|||
24 Jul 2005, 17:04 |
|
revolution 25 Jul 2005, 01:46
tom tobias: I know you did not quote me above but I want to clarify what I said previously about the cache. Depending on the particular application you are running using a table might make things slower because of cache eviction and thrashing. Now, let's say our core module doing most of the intense calculation is 20kB and we use a look-up table for COS/SIN using another 16kB. My Pentium M laptop has 32kB L1 cache so in this situation something has to be pushed out of L1 cache when we run the code. But if instead we use a loop to calculate COS/SIN that is only a few hundred bytes we can fit more of other code into the L1 cache. But, as I said, it all depends on your application. In a different situation a table might be just perfect and give a great speed-up.
As for the page size of 4kB or 4MB, I doubt if that has much effect on the overall runtime speed. The cache and page mapping mechanisims are separate parts within the processor. Also: when I did a quick test with r22's code above and the standard FCOS I found the timing on my Pent-M like this: FCOS: 1.0 r22: 0.76 r22 with my mod above: 0.72 The times shown are normalised. I did 256meg iterations for each and I expect all code and data was in the L1 cache during the test. |
|||
25 Jul 2005, 01:46 |
|
MCD 26 Jul 2005, 13:34
Quote: The most double precision FP values you can fit in one page (4kb) is 512 _________________ MCD - the inevitable return of the Mad Computer Doggy -||__/ .|+-~ .|| || |
|||
26 Jul 2005, 13:34 |
|
MCD 27 Jul 2005, 08:03
Can't this
r22 wrote:
be simplyfied to Code: MOVSD xmm0,qword[eax] MOVLHPS xmm0,xmm0 And another thing. Every alternating taylor iteration a can be refined by simply averaging a_(n-1) and a_n which always results in dividing the very last coefficient by 2. I was about to write this formula with some Sigma symbol, but was too lazy. So, if you really need to take all the iterations you mentioned, the the last value in your cosine coefficient table would be: Code: dq 1,6399446185349189550760208636561e-30 , -2,479596263224797460074943545848e-27 Only disadvantage is that this way, it's harder to continue the iteration if you want to do so one day. |
|||
27 Jul 2005, 08:03 |
|
revolution 27 Jul 2005, 08:52
Quote: be simplyfied to I tried: Code: MOVSD xmm0,qword[eax] PSHUFD xmm0,xmm0,044h But it was slower. Don't know why. |
|||
27 Jul 2005, 08:52 |
|
MCD 27 Jul 2005, 13:35
revolution wrote:
That's hard to guess why. On which CPU have you tested it? Maybe this is because it has an immediate. Or what is more likely is, that PSHUFD is 1 byte longer, because it requires a 66h prefix byte, whereas movlhps doesn't. Back to the original code. Can't r22 just replace Code: MOVHPD qword[eax],xmm3 ADDSD xmm3,qword[eax] MOVSD qword[eax],xmm3 with Code: MOVHLPS xmm0/1/2,xmm3 ADDSD xmm3,xmm0/1/2;means any of xmm0 xmm1 or xmm2 MOVSD qword[eax],xmm3 Anyway, here is the complete cosine routine I have rewritten Code: align 16 CosSSE8_2: mov eax,[esp+4] ;Load eax before uding it mov ecx,-16*6 ;Load ecx before using it movsd xmm0,qword[eax] movlhps xmm0,xmm0 mulpd xmm0,xmm0 ;1 FP mul eleminated And try to put.. movdqa xmm1,xmm0 mulsd xmm0,xmm0 ;..some space between... movdqa xmm2,xmm0 mulpd xmm1,xmm1 ;...multiple multiplications movdqa xmm3,dqword [CosOne] .CosLoop: mulpd xmm0,dqword [CosTab+ecx+16*6] addpd xmm3,xmm0 mulpd xmm2,xmm1 movdqa xmm0,xmm2 add ecx,16 jnz .CosLoop ;Maybe you prefer js or jnc ;SSE3 HADDPD movhlps xmm0,xmm3 addsd xmm3,xmm0 movsd qword [eax],xmm3 ret 4 align 16 CosOne dq 1.0, 0.0 CosTab dq 4.166666666666667e-2 , -5.000000000000000e-1;I guess 16 digits are enough for 64bit fp dq 2.480158730158730e-5 , -1.388888888888888e-3 dq 2.087675698786810e-9 , -2.755731922398589e-7 dq 4.779477332387385e-14, -1.147074559772972e-11 dq 4.110317623312165e-19, -1.561920696858623e-16 dq 1.611737571096118e-24, -8.896791392450573e-22 dq 1.639944618534919e-30, -2.479596263224797e-27 _________________ MCD - the inevitable return of the Mad Computer Doggy -||__/ .|+-~ .|| || |
|||
27 Jul 2005, 13:35 |
|
r22 27 Jul 2005, 17:08
Shuffling the MOVDQA's squeaks out a few ticks faster ~5%
Putting one between the MOVLHPS and the MULPD Also using 5 iterations, 6 doesn't seem to improve the accuracy of the results (this could be the visual studio 6 printf api I'm using not sure) Code: align 16 ;fASM Community SSE cosine CosSSE8_2: ;;parallelism mov eax,[esp+4] mov ecx,-16*5 ;10 iterations (5*2) MOVSD xmm0,qword[eax] ;\ MOVLHPS xmm0,xmm0;qword[eax] ;SSE3 MOV-DUP MOVDQA xmm3, dqword[CosOneP] ; 1.0 | 0.0 MULPD xmm0,xmm0 ;x^2 x^2 MOVDQA xmm1,xmm0 ; HI LO MULSD xmm0,xmm0 ; x^2 | x^4 MOVDQA xmm2,xmm0 ;copy for scaling x^'s with xmm1 MULPD xmm1,xmm1 ; x^4 | x^4 .LP: MULPD xmm0, dqword[CosTableP+16*5+ecx] ;x^? / (?)! ADDPD xmm3,xmm0 ;1 - + - + -... MULPD xmm2,xmm1 ;x^? | x^? TO x^(?+4) | x^(?+4) MOVDQA xmm0,xmm2 add ecx,16 js .LP ;add high qword and low qword of xmm3 for result ;err no SSE3 horizontal add yet so, cheap work around ;SSE3 HADDPD MOVHLPS xmm1,xmm3;,0eeh ;didn't think of it addsd xmm3,xmm1 ;suggested by: revolution MOVSD qword[eax],xmm3 ret 4 align 16 CosOneP dq 1.0, 0.0 align 16 ;Table of 1/4! | 1/2!, 1/8! | 1/6! ... CosTableP dq 0.041666666666666666666666666666667, -0.5 dq 2.4801587301587301587301587301587e-5, -0.0013888888888888888888888888888889 dq 2.0876756987868098979210090321201e-9, -2.7557319223985890652557319223986e-7 dq 4.7794773323873852974382074911175e-14, -1.1470745597729724713851697978682e-11 dq 4.1103176233121648584779906184361e-19, -1.5619206968586226462216364350057e-16 dq 1.6117375710961183490487133048012e-24, -8.8967913924505732867488974425025e-22 dq 3.2798892370698379101520417273121e-30, -2.479596263224797460074943545848e-27 Precision break down Quote:
I think it's safe to say we optimized this function to death Can't wait to use the SSE3 opcodes and make a Single precision one for game engines (put it up against a LUT). The setup cost might be a little high to get the xmm0 register filled with x^2 | x^4 | x^6 | x^8 to make the single precision version 2times more parallel. But it would only have to loop twice. Can't wait to buy a 4400+ in sept. |
|||
27 Jul 2005, 17:08 |
|
MCD 28 Jul 2005, 14:08
Just some more generalizations: you can use the same code with only different table values to calculate other tylor-based functions, like sine, arcustangens, natural exponential and logarythm a.s.o.
Another thing I though was to optimize the cosine algorythm by passing smaller values to it, like this equation allows: Code: SQR(COS(a*0.5))*2 - 1 = COS(a) naturally you can nest it, but I doubt that this isn't worth the effort in terms of precision/speed. Second idea I got was to get the Newton-iteration for cosine (Get one lookup-table value and refine it with only a few, squarely approaching iterations) , but that one is too calculation time intensive and thus completely useless: Code: x_n+1 = x_n + (arccos(x) - a)*SqRt(1-x_n*x_n) Useless to say that arccos and a square root in a newton iteration is not THAT fast Another approach is to combine a lookup table and use the taylor-based function code only to calculate the rest, like the following formula may allow: Code: COS(a) - COS(b) = -2 * SIN((a+b)*0.5) * SIN((a-b)*0.5) let b=Trunc(a) <=> COS(a) = COS(Trunc(a)) - SIN((a+Trunc(a))*0.5) * SIN((a-Trunc(a))*0.5) * 2 <=> COS(a) = COSTab[Trunc(a)] - SIN((a+Trunc(a))*0.5) * SIN((a-Trunc(a)*0.5) * 2 ;1.) In 1.) COSTab[Trunc(a)] is a value given by a lookup table (very fast to compute) and SIN((a-Trunc(a)*0.5) should also be very fast to compute since it should be very close to 0. The only problem is SIN((a+Trunc(a))*0.5), which should be somehow rearranged to be also very close to 0. Not an easy task. Anyone an idea? Hope that wasn't too much math at a time _________________ MCD - the inevitable return of the Mad Computer Doggy -||__/ .|+-~ .|| || |
|||
28 Jul 2005, 14:08 |
|
r22 28 Jul 2005, 16:58
Ultrano used an FPU method for the sin function with a LUT http://board.win32asmcommunity.net/index.php?topic=21430.0
|
|||
28 Jul 2005, 16:58 |
|
MCD 01 Aug 2005, 14:11
Okay, I finally got a very fast and effective method to calculate (double precision) cosine by combining a LUT containint 256 both sine and cosine values for 1 cycle (2pi).
The principle is as following: Code: cos(a) = cos( a-round(a) + round(a) ) using this trigonometric equation cos(a+b) = cos(a)*cos(b) - sin(a)*sin(b) => cos(a-round(a)) * costab[round(a)] - sin(a-round(a)) * sintab[round(a)] The idea behind all that is that both cos(a-round(a)) and sin(a-round(a)) should converge much faster that simple cos(a) calculation. Furthermore costab[a] and sintab[a] should also be very fast to compute. And here goes the code: Code: CosDblSSE: movlps xmm0,[esi];source value (angle in radiant) mulsd xmm0,[Rad2LUT] cvtsd2si eax,xmm0 cvtsi2sd xmm1,eax and eax,0FFh shl eax,4 subsd xmm0,xmm1 mulsd xmm0,[LUT2Rad] movlhps xmm0,xmm0 mulsd xmm0,xmm0 movaps xmm1,xmm0 movaps xmm2,xmm0 movlhps xmm2,xmm2 mulpd xmm1,xmm2 mulsd xmm0,[_0.5] addsd xmm0,[_1.0] mulpd xmm2,xmm1 mulpd xmm1,[CosSinCoeff] addpd xmm0,xmm1 mulpd xmm2,[CosSinCoeff+16] addpd xmm0,xmm2 mulpd xmm0,[CosSinLUT+eax] movhlps xmm1,xmm0 subsd xmm0,xmm1 movlps [edi],xmm0;destination value ret align 16 Rad2LUT dq 4.074366543152521e1 ;256/(2*pi) LUT2Rad dq 2.454369260617026e-2;2*pi/256 _0.5 dq 0.5 _1.0 dq 1.0 CosSinCoeff: ;COS, -SIN dq 4.166666666666667e-2, -1.666666666666667e-1; 1/4!, -1/3! dq -1.388888888888889e-3, 8.333333333333333e-3;-1/6!, 1/5! CosSinLUT: rb 256*8*2;Have to be setup before first use ;The format of the LUT is as follows: ;[Double cosine(0*(2pi/256))] [Double sine(0*(2pi/256))] ; 0 8 ;[Double cosine(1*(2pi/256))] [Double sine(1*(2pi/256))] ; 10h 18h ; . . . . . . The code for initializitng the LUT is not included, but this can be done with simple FSINCOS. Unfortunately I don't have a CPU that supports SSE2/SSE3 instructions, so I was unable to test this procedure. Hopefully I have this SSE1 single precision equivalent cosine procedure: Code: CosSnglSSE: movss xmm0,[esi] mulss xmm0,[Rad2LUT] cvtsd2si eax,xmm0 cvtsi2sd xmm1,eax and eax,0FFh shl eax,4 subss xmm0,xmm1 mulss xmm0,[LUT2Rad] movaps xmm1,xmm0 mulss xmm1,xmm1 movlhps xmm1,xmm1 mulss xmm1,xmm0 movhps xmm0,qword [_1.0] mulps xmm1,[SinCosCoeff] addps xmm0,xmm1 mulps xmm0,[SinCosLUT+eax] ; 1.) movhlps xmm1,xmm0 subss xmm1,xmm0 ;1.) movss [edi],xmm1 ret align 16 Rad2LUT dd 40.743665 ;256/(2*pi) LUT2Rad dd 0.024543692;2*pi/256 _1.0 dd 1.0,0 SinCosCoeff: dd -0.16666667,0, -0.5,0;-1/3!, -1/2! SinCosLUT: rb 256*4*4;Have to be setup before first use ;The format of the LUT is as follows: ;[Single sine(0*(2pi/256))] [0] [Single cosine(0*(2pi/256))] [0] ; 0 4 8 0Ch ;[Single sine(1*(2pi/256))] [0] [Single cosine(1*(2pi/256))] [0] ; 10h 14h 18h 1Ch ; . . . . . . Tests showed that this works like it should, except for angles close to pi+pi*2*n, where the result gets several unprecise digits. That's neither because the Taylor-cosine/sine calculations have to few steps nor the LUT is to small, the error happens at the very end of the calculation where all 4 intermidiate results are processed at the 1.) lines following this formula: cos(a)*cos(b) - sin(a)*sin(b) Unfortunately, these errors are caused be the nature of the floating point numbers itself, and since we can't defactorize this equation, there is nothing we can do except computing those last steps in double precision if we really want such precise results _________________ MCD - the inevitable return of the Mad Computer Doggy -||__/ .|+-~ .|| || |
|||
01 Aug 2005, 14:11 |
|
revolution 01 Aug 2005, 15:48
I get the following timing results with my test code:
FCOS: 1.000 r22 last suggestion: 0.6341 MCD 256 LUT: 0.6442 (accurate to 5 decimal places) MCD: The result from your code only has 5 digits of accuracy, is that deliberate or my mistake? I used SSE2 and initialised the LUT with the following code: Code: init_CosSinLUT: mov edi,CosSinLUT fld qword[_2div256] ;2/256 fldpi fmulp fldz .a: fld st0 fsincos fstp qword[edi+0] fstp qword[edi+8] fadd st0,st1 add edi,16 cmp edi,CosSinLUT+16*256 jb .a fucompp ret |
|||
01 Aug 2005, 15:48 |
|
MCD 02 Aug 2005, 11:39
Your code seems to be correct. As well as mine. My code seems to have precision difficulties with input angles close to pi + n*2*pi with n is any valid integer. So please tell me the value you have tested the code with.
Well, and the issue of speed is also clear to me. My code is simply not that optimized yet, and I guess that it's still that slow because of lots of dependencies from previous instruction results. Someone should fix that. edit: I've just tested my code (on another computer with SSE2) with the worst-case angle of pi, and my cos-procedure had still 8 correct decimal digits. Values further away from pi even have full double precision. |
|||
02 Aug 2005, 11:39 |
|
Madis731 02 Aug 2005, 13:28
Great page on square-root if you are looking for it:
http://www.azillionmonkeys.com/qed/sqroot.html There are integer and float code samples, they are in C and in ASM. The page has changed since I last checked it but quick look tells me that the algorithms have remained. Of course a lot of additions are made. Quote:
|
|||
02 Aug 2005, 13:28 |
|
revolution 02 Aug 2005, 14:29
MCD: I was testing with the neutral COS value +0.739085133215160642. When I use your code it settles at +0.739088549289849461.
|
|||
02 Aug 2005, 14:29 |
|
MCD 04 Aug 2005, 11:55
Thanks for the testing. I missed a really stupid sign bug.
The _0.5 constant Code: _0.5 dq 0.5 should be changed to Code: _Neg0.5 dq -0.5 These sign bugs are pretty nerved With some code rearrangements (maybe a bit faster/slower on your CPU), this should be it: Code: CosDblSSE: movlps xmm0,[esi];source value (angle in radiant) mulsd xmm0,[Rad2LUT] cvtsd2si eax,xmm0 cvtsi2sd xmm1,eax subsd xmm0,xmm1 mulsd xmm0,[LUT2Rad] movlhps xmm0,xmm0 mulsd xmm0,xmm0 movaps xmm1,xmm0 movaps xmm2,xmm0 movlhps xmm2,xmm2 mulpd xmm1,xmm2 and eax,0FFh mulpd xmm2,xmm1 shl eax,4 mulpd xmm1,[CosSinCoeff] mulpd xmm2,[CosSinCoeff+16] addsd xmm0,[_Neg2.0] mulsd xmm0,[_Neg0.5] addpd xmm0,xmm1 addpd xmm0,xmm2 mulpd xmm0,[CosSinLUT+eax] movhlps xmm1,xmm0 subsd xmm0,xmm1 movlps [edi],xmm0;destination value ret align 16 Rad2LUT dq 4.074366543152521e1 ;256/(2*pi) LUT2Rad dq 2.454369260617026e-2;2*pi/256 CosSinCoeff: ;COS, -SIN dq 4.166666666666667e-2, -1.666666666666667e-1; 1/4!, -1/3! dq -1.388888888888889e-3, 8.333333333333333e-3;-1/6!, 1/5! _Neg2.0 dq -2.0; just an optimization _Neg0.5 dq -0.5;fixed CosSinLUT: rb 256*8*2;Have to be setup before first use ;The format of the LUT is as follows: ;[Double cosine(0*(2pi/256))] [Double sine(0*(2pi/256))] ; 0 8 ;[Double cosine(1*(2pi/256))] [Double sine(1*(2pi/256))] ; 10h 18h ; . . . . . . Now it seems to work flawless, but still except for values close to pi + 2*n*pi Maybe precaching [COSSinLUT+eax] with prefetch instructions would speed this a little bit up. Another, last big speed up can be achieved with this if you need to calculate both sine and cosine: Code: CosSinDblSSE: movlps xmm0,[esi];source value (angle in radiant) mulsd xmm0,[Rad2LUT] cvtsd2si eax,xmm0 cvtsi2sd xmm1,eax subsd xmm0,xmm1 mulsd xmm0,[LUT2Rad] movlhps xmm0,xmm0 mulsd xmm0,xmm0 movaps xmm1,xmm0 movaps xmm2,xmm0 movlhps xmm2,xmm2 mulpd xmm1,xmm2 and eax,0FFh mulpd xmm2,xmm1 shl eax,4 mulpd xmm1,[CosSinCoeff] mulpd xmm2,[CosSinCoeff+16] addsd xmm0,[_Neg2.0] mulsd xmm0,[_Neg0.5] addpd xmm0,xmm1 addpd xmm0,xmm2 movlhps xmm1,xmm0 movhlps xmm1,xmm0 mulpd xmm0,[CosSinLUT+eax] mulpd xmm1,[CosSinLUT+eax] movhlps xmm2,xmm0 movhlps xmm3,xmm1 subsd xmm0,xmm2 addsd xmm1,xmm3 movlhps xmm0,xmm1 movaps [edi],xmm0;destination value: cos(angle), sin(angle) ret _________________ MCD - the inevitable return of the Mad Computer Doggy -||__/ .|+-~ .|| || |
|||
04 Aug 2005, 11:55 |
|
Goto page 1, 2, 3 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.