flat assembler
Message board for the users of flat assembler.
Index
> Main > FAST COSINE DOUBLE PRECISION FP Goto page Previous 1, 2, 3 Next 
Author 

revolution
I found for my system that the following variant of MCD's code is the fastest:
Code: movlpd xmm0,[esi] ;source value (angle in radians) mulsd xmm0,[Rad2LUT] cvtsd2si eax,xmm0 cvtsi2sd xmm1,eax subsd xmm0,xmm1 mulsd xmm0,[LUT2Rad] movlhps xmm0,xmm0 mulsd xmm0,xmm0 movapd xmm1,xmm0 movapd xmm2,xmm0 movlhps xmm2,xmm2 mulpd xmm1,xmm2 and eax,0FFh mulpd xmm2,xmm1 shl eax,4 mulpd xmm1,[CosSinCoeff] mulpd xmm2,[CosSinCoeff+16] addsd xmm0,[_Neg2.0] mulsd xmm0,[_Neg0.5] addpd xmm0,xmm1 addpd xmm0,xmm2 mulpd xmm0,[CosSinLUT+eax] movhlps xmm1,xmm0 subsd xmm0,xmm1 movlpd [edi],xmm0 ;destination value 

07 Aug 2005, 07:27 

MCD
Good. But I was surprised that "MOVAPD" here seems to be faster than "MOVAPS", although "MOVAPD" is 1 byte longer and both are documented to perform exactly the same. Maybe it's some internal optimization. Or more likely, it's just a code align issue.
Even If the idea of combining a look up table and a taylor iteration with a faster converting, smaller remainder for trigonometrical calculations based upon the addion theorem is very good, it's still doesn't do it's job better. Firstly, for certain input angles, it's getting precisions problems (may be perhaps solved). Secondly a quiet heavy data sizzling and converting overhead is required, which strips down the acquired performance gain to null. Maybe more precise calculations, like arbitrary number calculations would benefit from it. While I'm at talking about mathcalculations, I must admit that I was about writing my own arbitrary precision library (naturally with Fasm) before I got stuck optimizing Fasm itself. I just have to wait until my Fasm version reresembles to something usable, and this may take several weeks to some months. It's mostly unclear at the moment. Well, I know there are some excellent arbitrary precision libraries out there, but I wanted one that fits all of the following conditions: should only require a 386, but speed relevant parts should be optimized with MMX/SSE (when such a CPU is present) => library requires a CPU recognization good organized code, like Fasm itself =>only a few includefiles only use integer calculations basic data types should be integerarrays and fractionalarrays(constitues of 2 integer arrays). Later you get them into structures with sign, error codes, special fields (infinity) a.s.o. code should be broken down into several parts. The lowest part contains binary, shift/rotate,copy/clear,reverse/exchange,interleave/uninterleave,add/sub,scan/cmp, simple scalar mul, cross multiply a.s.o. operations simple arrays with no error detection in their routines itself, no special fields, nor sign. Also the lowest part should have string/number array and number array/string converting routines for displaying values (simple base convertings). At least the lowest part should have it's routines both in and outofplace, unless it's really impossible. The next higher part uses fractional number arrays for higher level calculations like newton and taylor based calcualtions: divide, reciproke, logarythm, exponent, sine/cosine, roots, powers...thus this level requires error detection at some points, which stores it's results into some special fields into the structures. Another special field for sign (not 2's complement numbers), (multiple) infinity... should be given. This level should be based upon lowest part for it's calculations. speed optimized, not code size optimized => use features, which are quiet impossible with HLLs, as for example, using the carry flag for arbitrary long adds/subs. Also may use bigger data elements (not bytewise, as some do). I had planned to use a general 16byte boundary, because of SSE. 

08 Aug 2005, 13:47 

revolution
Quote: But I was surprised that "MOVAPD" here seems to be faster than "MOVAPS" Quote: ... However, because of the type mismatch between the operand data type and the instruction data type, a latency penalty will be incurred due to implementations of the instructions at the microarchitecture level. MCD: If you are going to use VERY big numbers you will need to implement FFT for your multiplications. Using only integer calculations will be slower. Most arbitrary precision libraries are for integer only. Things like LL, RSA and ECC only need integer calculations. But for COS, SIN, LOG etc. I can't think of a use for very high precision (except perhaps to break a world record for most accurate SQRT(2) or something similar) 

09 Aug 2005, 02:40 

MCD
revolution wrote:
Thx. I haven't read that chapter in detail yet. _________________ MCD  the inevitable return of the Mad Computer Doggy __/ .+~ .  

09 Aug 2005, 14:25 

Garthower
2MCD: Do you can post along procedure to calculate sin? I not so understand the formula of calculation that most to choose the need code.


23 Feb 2007, 12:22 

MCD
Garthower wrote: 2MCD: Do you can post along procedure to calculate sin? I not so understand the formula of calculation that most to choose the need code. Sorry, but I have a little trouble in understanding what you actually want from me. do you want a procedure that caculates the sine (instead of codsine) => my procedure actually calculates both sine AND cosine do you want some math background info how those algorithms work? If so, I will try to post something if you answer that that is what you wanted. 

24 Feb 2007, 05:07 

Garthower
Quote: do you want a procedure that caculates the sine (instead of codsine) => my procedure actually calculates both sine AND cosine If I am not mistaken, you have put here two procedures  calculation cosine and analogue of a FPUcommand fsincos. But separate procedure of calculation only a sine is necessary to me also. Quote: do you want some math background info how those algorithms work? If so, I will try to post something if you answer that that is what you wanted. Yes, it for me is important, to understand how it works. 

24 Feb 2007, 11:58 

Helle
@Garthower: Sine:
Code: format PE CONSOLE 4.0 entry start Include 'win32a.inc' section ".code" code readable writeable executable string db "Sine value from Rad approximative:",0 szEnv db "%s %1.15f",13,10,0 align 16 RF1 dq 8.333333333333333333e3 ; 1/5! dq 1.666666666666666667e1 ;1/3! dq 2.755731922398589065e6 ; 1/9! dq 1.984126984126984127e4 ;1/7! dq 1.605904383682161460e10 ; 1/13! dq 2.505210838544171878e8 ;1/11! Rad dq 0.123456789 Result dq ? start: ; Sin(x) = x  (x^3)/3! + (x^5)/5!  (x^7)/7! + (x^9)/9!  (x^11)/11! + (x^13)/13!  + ... ; Rad = 1.57079632679 to +1.57079632679 (90° to +90°) or more iterations for more precision! ; SSE2 movsd xmm0,[Rad] ;1 = x 1 = QuadWord1 2 = QuadWord2 movhpd xmm0,[Rad] ;1 and 2 = x ;SSE3: movddup xmm0,[Rad] movapd xmm1,xmm0 ;XMM1 = XMM0 movapd xmm5,xmm0 mulpd xmm0,xmm1 ;XMM0: 1 = x^2 2 = x^2 movapd xmm1,xmm0 ;XMM1: 1 = x^2 2 = x^2 mulsd xmm0,xmm1 ;XMM0: 1 = x^4 2 = x^2 mulpd xmm1,xmm1 ;XMM1: 1 = x^4 2 = x^4 mulpd xmm0,xmm5 ;XMM0: 1 = x^5 2 = x^3 movq xmm3,[Rad] ;XMM3: 1 = x 2 = 0.0 movapd xmm2,xmm0 ;XMM2 = XMM0 mov ecx,48 ;16 Bytes ReadIn, 3 Loops lea esi,[RF1] @@: movupd xmm4,[esi+48+ecx] mulpd xmm0,xmm4 addpd xmm3,xmm0 mulpd xmm2,xmm1 movapd xmm0,xmm2 add ecx,16 jnz @r movhlps xmm4,xmm3 addsd xmm4,xmm3 movsd qword[Result],xmm4 invoke printf,szEnv, string, dword[Result], dword[Result+4] invoke getchar invoke ExitProcess,0 section '.idata' import data readable writeable library kernel32,'kernel32.dll',\ msvcrt,'msvcrt.dll' import kernel32,\ ExitProcess,'ExitProcess' import msvcrt,\ getchar,'getchar',\ printf,'printf' Gruss Helle 

24 Feb 2007, 17:41 

Garthower
Hi Helle!
In your source in comments you wrote: Quote: Rad = 1.57079632679 to +1.57079632679 (90° to +90°) or more iterations for more precision! On how many it's need to make iterations for corners, for example, in a range 180° to +180°? By what principle to select quantity of iterations? 

03 Mar 2007, 13:32 

Helle
Hello Garthower,
the quantity of iterations is the precision which you want. I probe with a maxrange for a given program with the precision which I need. Is the SSEspeed better as FPU  o.K., if not, I take FPU. For instance, for monitordisplaying (displaypixel in a game, ballistic in a shooter) are 78 significant digits enough. This is faster with SSEiterations as the FPU! BTW: The (appr.)FPUprecision is 7 significant digits for Single (32bit) and 15 for Double (64bit). Example for a test with Sine(180°), max.12 iterations: Code: format PE CONSOLE 4.0 entry start Include 'win32a.inc' section '.code' code readable writeable executable string db 'Sine value from Pi approximative (Rad)',0 szEnv db '%s %2.1d Iterations: %5.15f ',13,10,0 align 16 RF1 dq 8.333333333333333333e3 ; 1/5! 2.Iteration dq 1.666666666666666667e1 ;1/3! 1.Iteration dq 2.755731922398589065e6 ; 1/9! 4.Iteration dq 1.984126984126984127e4 ;1/7! 3.Iteration dq 1.605904383682161460e10 ; 1/13! 6.Iteration dq 2.505210838544171878e8 ;1/11! 5.Iteration dq 2.811457254345520763e15 ; 1/17! 8.Iteration dq 7.647163731819816476e13 ;1/15! 7.Iteration dq 1.957294106339126123e20 ; 1/21! 10.Iteration dq 8.220635246624329717e18 ;1/19! 9.Iteration dq 6.446950284384473396e26 ; 1/25! 12.Iteration dq 3.868170170630684038e23 ;1/23! 11.Iteration Rad dq 3.1415926535897932384626433832795 ;Pi, Sin(Pi)=0.0 (Rad) Result dq ? Iter dd 2 ;minimum start: ; Sin(x) = x  (x^3)/3! + (x^5)/5!  (x^7)/7! + (x^9)/9!  (x^11)/11! + (x^13)/13!  + ... ; SSE2 ; TestExample for x=Pi L1: movsd xmm0,[Rad] ;1 = x 1 = QuadWord1 2 = QuadWord2 movhpd xmm0,[Rad] ;1 and 2 = x movapd xmm5,xmm0 ;save XMM0 mulpd xmm0,xmm0 ;XMM0: 1 = x^2 2 = x^2 movapd xmm1,xmm0 ;XMM1: 1 = x^2 2 = x^2 mulsd xmm0,xmm1 ;XMM0: 1 = x^4 2 = x^2 mulpd xmm1,xmm1 ;XMM1: 1 = x^4 2 = x^4 mulpd xmm0,xmm5 ;XMM0: 1 = x^5 2 = x^3 movq xmm3,[Rad] ;XMM3: 1 = x 2 = 0.0 movapd xmm2,xmm0 ;XMM2 = XMM0 mov ecx,[Iter] shl ecx,3 ;*16 Bytes / 2 Iterations per Loop =8 lea esi,[RF1] add esi,ecx neg ecx @@: movupd xmm4,[esi+ecx] mulpd xmm0,xmm4 addpd xmm3,xmm0 mulpd xmm2,xmm1 movapd xmm0,xmm2 add ecx,16 jnz @r movhlps xmm4,xmm3 addsd xmm4,xmm3 movsd qword[Result],xmm4 invoke printf,szEnv, string, dword[Iter], dword[Result], dword[Result+4] add [Iter],2 cmp [Iter],12 ;MaxIter for this Example jbe L1 invoke getchar invoke ExitProcess,0 section '.idata' import data readable writeable library kernel32,'kernel32.dll',\ msvcrt,'msvcrt.dll' import kernel32,\ ExitProcess,'ExitProcess' import msvcrt,\ getchar,'getchar',\ printf,'printf' Gruss Helle 

03 Mar 2007, 18:19 

Garthower
But MCD's code for cosine calculating is universal for all range numbers without any floating iterations...


04 Mar 2007, 14:27 

Goplat
I've found that trying to calculate the sine/cosine of a single number using parallel SSE is actually slower than doing the same approximation using purely scalar SSE/FPU code. If it's possible to parallelize at a higher level (calculate two cosines at once) that would be best.


04 Mar 2007, 20:09 

asmfan
Approximations of sin, cos, etc. are based on Maclaurin's (Taylor's) series, which is basically infinite ;P The more members of serie you take, the more accurate will be result.


04 Mar 2007, 21:10 

MCD
I'm just dropping by to complete an older statement which I blamefully left unexplained for quiet some time now
MCD wrote:
Actually, my algorithm is numerically unstable for values close to pi/2 + k*pi. This is due to a loss of significance by the subtraction which is required to split the angle into a bigger, divided part (for the lookup table) and remainder part(for the polynomialseries approximation): (see http://en.wikipedia.org/wiki/Loss_of_significance if you want to know something about loss of significance.) MCD wrote:
I actually wrote _polynomial_series and not _Taylor_, because the series are not really the Tayloseries, but a better approximation. I used a kind of minimummaxDelta approximation with the one parameter fixed at 0 and only the other one left to be optimized (I remeber that I have found out these values by hand with a nested intervall algorithm). If I should ever recalculate these values, I think it would be more wise to chose some Tschebyscheffinterpolation with the 2 point being variable(and thus even more precise) instead of some handcrafted approximation. I don't think I'm going to rewrite this code soon due to lack of time. _________________ MCD  the inevitable return of the Mad Computer Doggy __/ .+~ .  

17 Jan 2008, 16:22 

edfed
how is the algorythm to obtain cosine with only +  * and / ?
anybody know? same question for square root, logarythm and neperian logarythm... ouch!! 

18 Jan 2008, 00:44 

asmfan
As stated above Taylor series on wikipedia.
You can easily write an iterative procedure to count sin,cos etc. with needed precision with only * and + and compare to needed loss error on each iteration > F(n+1)  F(n) < Error this could be the condition of exit from iteration loop. _________________ Any offers? 

18 Jan 2008, 07:16 

Madis731
It seems that you often miss the crucial point in nonARM architecture optimizations. When on a x86, then instructions with register,memory operands are faster than register,register because of latencies. There have been many occasions and proofs on these boards where this holds true.
http://board.flatassembler.net/topic.php?t=5122&postdays=0&postorder=asc&start=120 Xorpd! wrote:
I see many tries to cut down the instruction count, while not seeing the most obvious. When you load the two parts of an XMM register separately, its bad, but when you load something into it and then shuffle, its even worst (as discovered). The problem is in the latency which comes from the dependency. The most straightforward and also valid way is to shuffle it straight from the memory and onthefly. The sample I'm going to analyze is this: Code: movsd xmm0,[Rad] ;1 = x 1 = QuadWord1 2 = QuadWord2 movhpd xmm0,[Rad] ;1 and 2 = x movapd xmm5,xmm0 ;save XMM0 mulpd xmm0,xmm0 ;XMM0: 1 = x^2 2 = x^2 Note: All sources to the amount of latency come from Agner Fog's manual instruction_tables.pdf and the timings are for Core 2 There's a scalar load, then a LATENCY because xmm0 is not ready writing. There's movhpd then which completes the xmm0...and then a LATENCY again. xmm5 is waiting for operations to complete on the xmm0 before it can read the results. Multiply itself has a long latency. This sequence of instructions therefore can never be completed under 10 clocks because there's no earlyout that gives other instructions the ability to start in the same clock. So coming from 24 down to 22 bytes (32bit mode): Code: pshufd xmm0,[Rad],01000100b ; xmm0 loaded with 2 QWORDs pshufd xmm5,[Rad],01000100b ; xmm5 loaded with 2 QWORDs mulpd xmm0,xmm0 ; XMM0: 1 = x^2 2 = x^2 Its 2 bytes shorter though it seems to be a waste using memory operations twice. The trick here is that pshufd with registers has a 3 clock additional latency, while with memory operand, it doesn't. They can't execute in parallel because they both use port 2 (which Core 2 has only one), but the multiply can start as early as on the third clock. Of course if you need to fit it in less space then use registers for memory reference: Code: ; 14 bytes, don't use esp because it means one extra byte pshufd xmm0,[eax],01000100b ; xmm0 loaded with 2 QWORDs pshufd xmm5,[eax],01000100b ; xmm5 loaded with 2 QWORDs mulpd xmm0,xmm0 ; XMM0: 1 = x^2 2 = x^2 Now the most crucial point remains. The mulpd with register operands is claimed to have 5 (FIVE!!!) clock latency when with a memory operand it only has ... none What we should do is change it to mulpd xmm0,[mem] and this can be done by having our value ready in memory. Some tweaks here and there and a few optimizations later what we have now is this: Code: ;17 bytes mov eax,Rad ;eax is the reference to Rad with 2 QWORDs movapd xmm0,[eax] ;xmm0 loaded with 2 QWORDs movapd xmm5,xmm0 ;xmm5 ... mulpd xmm0,[eax] ;xmm0*xmm0 Rad: dq 3.1415926535897932384626433832795 dq 3.1415926535897932384626433832795 Here I sacrificed 8 bytes of memory for another Pi but from 24 to 17 bytes its 7 bytes less code. We lost only 1 byte The speed? mov eax,mem can be and should be done only once before the loop. Because it has a oneclock latency, the first loop can begin from the second clock. The other iterations have eax initialized already and if we assume clock starting from the L1: (the first SSE instruction) then we get first load in the first clock. Here is why I reverted back to reg,reg version of movapd. When all these three instructions were reg,mem then it would take 3 clocks because of the heavy load on port 2. Now that movapd reg,reg takes that off, the multiply can begin in parallel with the load in the second clock. EDIT: Lets trust the reordering unit. Movapd should choose 1 or 5 to get multiply to port 0 as requested. EDIT: The result is that we came from the roughly 10+ clocks down to 2 or 3 depending on outoforder unit! Last edited by Madis731 on 18 Jan 2008, 21:28; edited 2 times in total 

18 Jan 2008, 09:15 

revolution
Don't forget that the C2 can decode up to 4 instructions in one clock only if they are in the same 16byte aligned memory. Your last piece of code:
Code: ;17 bytes mov eax,Rad ;eax is the reference to Rad with 2 QWORDs movapd xmm0,[eax] ;xmm0 loaded with 2 QWORDs mulpd xmm0,[eax] ;xmm0*xmm0 movapd xmm5,xmm0 ;xmm5 ... Rad: dq 3.1415926535897932384626433832795 dq 3.1415926535897932384626433832795 

18 Jan 2008, 11:53 

Madis731
Oh, damn, I was so focused on optimizing that I forgot about functionality :S I'll try to fix the previous post
Btw, I tried to give a go on the MCD's bestperforming code, but I got stuck. I'll think about it later, maybe I can get rid of all the dependencies: Code: movsd xmm0,[esi] ; faster than movlpd mulsd xmm0,[Rad2LUT] ; dependency cvtsd2si eax,xmm0 ; dep. cvtsi2sd xmm1,eax ; dep. subsd xmm0,xmm1 ; dep. mulsd xmm0,[LUT2Rad] ; dep. movlhps xmm0,xmm0 ; dep. mulsd xmm0,xmm0 ; dep. movapd xmm1,xmm0 ; dep. didn't know how to fix this movapd xmm2,xmm0 movlhps xmm0,xmm0 ; don't use the written reg immediately addsd xmm2,[_Neg2.0] ; interleave mulpd xmm1,xmm0 mulsd xmm2,[_Neg0.5] ; intl. and eax,0FFh mulpd xmm0,xmm1 shl eax,4 mulpd xmm1,[CosSinCoeff] mulpd xmm0,[CosSinCoeff+16] addpd xmm2,xmm1 addpd xmm2,xmm0 mulpd xmm2,[CosSinLUT+eax] ; dep. movhlps xmm1,xmm2 ; dep. subsd xmm2,xmm1 ; dep. movlpd [edi],xmm2 ;destination value One thing I was thinking about was the [esi] at the beginning. Maybe we can shuffle it into 2 QWORDs and change first 6 instructions accordingly. Then we can CVT using memory (remember, less latency) and remove the 7th instruction (MOVLHPS). 

18 Jan 2008, 12:03 

Goto page Previous 1, 2, 3 Next < Last Thread  Next Thread > 
Forum Rules:

Copyright © 19992020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.
Website powered by rwasa.