flat assembler
Message board for the users of flat assembler.

Index > Main > FAST COSINE DOUBLE PRECISION FP

Goto page Previous  1, 2, 3  Next
Author
Thread Post new topic Reply to topic
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20454
Location: In your JS exploiting you and your system
revolution 04 Aug 2005, 15:56
Quote:
Code:
        movlhps xmm1,xmm0 
        movhlps xmm1,xmm0     

Can be:
Code:
pshufd xmm1,xmm0,4eh    
also
Code:
movaps ...    
might work faster with
Code:
movapd ...    
But, not sure, haven't tested it, just a quick thought.
Post 04 Aug 2005, 15:56
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20454
Location: In your JS exploiting you and your system
revolution 07 Aug 2005, 07:27
I found for my system that the following variant of MCD's code is the fastest:
Code:
        movlpd  xmm0,[esi]      ;source value (angle in radians)
        mulsd   xmm0,[Rad2LUT]
        cvtsd2si eax,xmm0
        cvtsi2sd xmm1,eax
        subsd   xmm0,xmm1
        mulsd   xmm0,[LUT2Rad]
        movlhps xmm0,xmm0
        mulsd   xmm0,xmm0
        movapd  xmm1,xmm0
        movapd  xmm2,xmm0
        movlhps xmm2,xmm2
        mulpd   xmm1,xmm2
        and     eax,0FFh
        mulpd   xmm2,xmm1
        shl     eax,4
        mulpd   xmm1,[CosSinCoeff]
        mulpd   xmm2,[CosSinCoeff+16]
        addsd   xmm0,[_Neg2.0]
        mulsd   xmm0,[_Neg0.5]
        addpd   xmm0,xmm1
        addpd   xmm0,xmm2
        mulpd   xmm0,[CosSinLUT+eax]
        movhlps xmm1,xmm0
        subsd   xmm0,xmm1
        movlpd  [edi],xmm0      ;destination value    
The relative run time on my system is 0.6164 which makes the fastest (but only by a little) of all algorithms presented here.
Post 07 Aug 2005, 07:27
View user's profile Send private message Visit poster's website Reply with quote
MCD



Joined: 21 Aug 2004
Posts: 602
Location: Germany
MCD 08 Aug 2005, 13:47
Good. But I was surprised that "MOVAPD" here seems to be faster than "MOVAPS", although "MOVAPD" is 1 byte longer and both are documented to perform exactly the same. Maybe it's some internal optimization. Or more likely, it's just a code align issue.

Even If the idea of combining a look up table and a taylor iteration with a faster converting, smaller remainder for trigonometrical calculations based upon the addion theorem is very good, it's still doesn't do it's job better.

Firstly, for certain input angles, it's getting precisions problems (may be perhaps solved). Secondly a quiet heavy data sizzling and converting overhead is required, which strips down the acquired performance gain to null. Maybe more precise calculations, like arbitrary number calculations would benefit from it.

While I'm at talking about math-calculations, I must admit that I was about writing my own arbitrary precision library (naturally with Fasm) before I got stuck optimizing Fasm itself. I just have to wait until my Fasm version reresembles to something usable, and this may take several weeks to some months. It's mostly unclear at the moment.

Well, I know there are some excellent arbitrary precision libraries out there, but I wanted one that fits all of the following conditions:
-should only require a 386, but speed relevant parts should be optimized with MMX/SSE (when such a CPU is present)
=> library requires a CPU recognization
-good organized code, like Fasm itself
=>only a few include-files
-only use integer calculations
-basic data types should be integer-arrays and fractional-arrays(constitues of 2 integer arrays). Later you get them into structures with sign, error codes, special fields (infinity) a.s.o.
-code should be broken down into several parts.
The lowest part contains binary, shift/rotate,copy/clear,reverse/exchange,interleave/uninterleave,add/sub,scan/cmp, simple scalar mul, cross multiply a.s.o. operations simple arrays with no error detection in their routines itself, no special fields, nor sign.
Also the lowest part should have string/number array and number array/string converting routines for displaying values (simple base convertings). At least the lowest part should have it's routines both in- and out-of-place, unless it's really impossible.
The next higher part uses fractional number arrays for higher level calculations like newton and taylor based calcualtions: divide, reciproke, logarythm, exponent, sine/cosine, roots, powers...thus this level requires error detection at some points, which stores it's results into some special fields into the structures. Another special field for sign (not 2's complement numbers), (multiple) infinity... should be given. This level should be based upon lowest part for it's calculations.
-speed optimized, not code size optimized
=> use features, which are quiet impossible with HLLs, as for example, using the carry flag for arbitrary long adds/subs.
Also may use bigger data elements (not bytewise, as some do). I had planned to use a general 16byte boundary, because of SSE.
Post 08 Aug 2005, 13:47
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20454
Location: In your JS exploiting you and your system
revolution 09 Aug 2005, 02:40
Quote:
But I was surprised that "MOVAPD" here seems to be faster than "MOVAPS"
This is from the Intel manual:
Quote:
... However, because of the type mismatch between the operand data type and the instruction data type, a latency penalty will be incurred due to implementations of the instructions at the microarchitecture level.

Latency penalties can also be incurred by using move instructions of the wrong type. For example, MOVAPS and MOVAPD can both be used to move a packed single-precision operand from memory to an XMM register. However, if MOVAPD is used, a latency penalty will be incurred when a correctly typed instruction attempts to use the data in the register.

Note that these latency penalties are not incurred when moving data from XMM registers to memory.
It seems that internally the processor keeps a flag to indicate what type of data to expect (integer, PS or PD) for the next operation.

MCD: If you are going to use VERY big numbers you will need to implement FFT for your multiplications. Using only integer calculations will be slower.

Most arbitrary precision libraries are for integer only. Things like LL, RSA and ECC only need integer calculations. But for COS, SIN, LOG etc. I can't think of a use for very high precision (except perhaps to break a world record for most accurate SQRT(2) or something similar)
Post 09 Aug 2005, 02:40
View user's profile Send private message Visit poster's website Reply with quote
MCD



Joined: 21 Aug 2004
Posts: 602
Location: Germany
MCD 09 Aug 2005, 14:25
revolution wrote:
Quote:
But I was surprised that "MOVAPD" here seems to be faster than "MOVAPS"
This is from the Intel manual:
Quote:
... However, because of the type mismatch between the operand data type and the instruction data type, a latency penalty will be incurred due to implementations of the instructions at the microarchitecture level.

Latency penalties can also be incurred by using move instructions of the wrong type. For example, MOVAPS and MOVAPD can both be used to move a packed single-precision operand from memory to an XMM register. However, if MOVAPD is used, a latency penalty will be incurred when a correctly typed instruction attempts to use the data in the register.

Note that these latency penalties are not incurred when moving data from XMM registers to memory.
It seems that internally the processor keeps a flag to indicate what type of data to expect (integer, PS or PD) for the next operation.


Thx. I haven't read that chapter in detail yet.

_________________
MCD - the inevitable return of the Mad Computer Doggy

-||__/
.|+-~
.|| ||
Post 09 Aug 2005, 14:25
View user's profile Send private message Reply with quote
Garthower



Joined: 21 Apr 2006
Posts: 158
Location: Ukraine
Garthower 23 Feb 2007, 12:22
2MCD: Do you can post along procedure to calculate sin? I not so understand the formula of calculation that most to choose the need code.
Post 23 Feb 2007, 12:22
View user's profile Send private message Visit poster's website MSN Messenger ICQ Number Reply with quote
MCD



Joined: 21 Aug 2004
Posts: 602
Location: Germany
MCD 24 Feb 2007, 05:07
Garthower wrote:
2MCD: Do you can post along procedure to calculate sin? I not so understand the formula of calculation that most to choose the need code.

Sorry, but I have a little trouble in understanding what you actually want from me.

-do you want a procedure that caculates the sine (instead of codsine) => my procedure actually calculates both sine AND cosine
-do you want some math background info how those algorithms work? If so, I will try to post something if you answer that that is what you wanted.
Post 24 Feb 2007, 05:07
View user's profile Send private message Reply with quote
Garthower



Joined: 21 Apr 2006
Posts: 158
Location: Ukraine
Garthower 24 Feb 2007, 11:58
Quote:
-do you want a procedure that caculates the sine (instead of codsine) => my procedure actually calculates both sine AND cosine


If I am not mistaken, you have put here two procedures - calculation cosine and analogue of a FPU-command fsincos. But separate procedure of calculation only a sine is necessary to me also.

Quote:
-do you want some math background info how those algorithms work? If so, I will try to post something if you answer that that is what you wanted.


Yes, it for me is important, to understand how it works.
Post 24 Feb 2007, 11:58
View user's profile Send private message Visit poster's website MSN Messenger ICQ Number Reply with quote
Helle



Joined: 24 Feb 2007
Posts: 23
Location: Germany
Helle 24 Feb 2007, 17:41
@Garthower: Sine:
Code:
format PE CONSOLE 4.0
entry start

Include 'win32a.inc'  

section ".code" code readable writeable executable  

        string  db "Sine value from Rad approximative:",0
        szEnv   db "%s %1.15f",13,10,0
align 16
        RF1     dq  8.333333333333333333e-3      ; 1/5!
                dq -1.666666666666666667e-1      ;-1/3!
                dq  2.755731922398589065e-6      ; 1/9!
                dq -1.984126984126984127e-4      ;-1/7!
                dq  1.605904383682161460e-10     ; 1/13!
                dq -2.505210838544171878e-8      ;-1/11!

        Rad     dq  0.123456789
        Result  dq  ?
start:
; Sin(x) = x - (x^3)/3! + (x^5)/5! - (x^7)/7! + (x^9)/9! - (x^11)/11! + (x^13)/13! - + ...
; Rad = -1.57079632679 to +1.57079632679  (-90° to +90°) or more iterations for more precision!
; SSE2
     movsd xmm0,[Rad]        ;1 = x    1 = Quad-Word1  2 = Quad-Word2
     movhpd xmm0,[Rad]       ;1 and 2 = x
                             ;SSE3: movddup xmm0,[Rad]
     movapd xmm1,xmm0        ;XMM1 = XMM0
     movapd xmm5,xmm0
     mulpd xmm0,xmm1         ;XMM0: 1 = x^2   2 = x^2
     movapd xmm1,xmm0        ;XMM1: 1 = x^2   2 = x^2
     mulsd xmm0,xmm1         ;XMM0: 1 = x^4   2 = x^2
     mulpd xmm1,xmm1         ;XMM1: 1 = x^4   2 = x^4
     mulpd xmm0,xmm5         ;XMM0: 1 = x^5   2 = x^3
     movq xmm3,[Rad]         ;XMM3: 1 = x     2 = 0.0
     movapd xmm2,xmm0        ;XMM2 = XMM0

     mov ecx,-48             ;16 Bytes Read-In, 3 Loops
     lea esi,[RF1]
@@:  movupd xmm4,[esi+48+ecx]
     mulpd xmm0,xmm4
     addpd xmm3,xmm0
     mulpd xmm2,xmm1
     movapd xmm0,xmm2
     add ecx,16
     jnz @r

     movhlps xmm4,xmm3
     addsd xmm4,xmm3
     movsd qword[Result],xmm4

     invoke printf,szEnv, string, dword[Result], dword[Result+4]
     invoke getchar
     invoke ExitProcess,0

section '.idata' import data readable writeable  

  library kernel32,'kernel32.dll',\  
          msvcrt,'msvcrt.dll'  

  import kernel32,\  
         ExitProcess,'ExitProcess'  

  import msvcrt,\  
         getchar,'getchar',\  
         printf,'printf'                                                                               
    

Gruss
Helle
Post 24 Feb 2007, 17:41
View user's profile Send private message Reply with quote
Garthower



Joined: 21 Apr 2006
Posts: 158
Location: Ukraine
Garthower 03 Mar 2007, 13:32
Hi Helle!

In your source in comments you wrote:
Quote:
Rad = -1.57079632679 to +1.57079632679 (-90° to +90°) or more iterations for more precision!


On how many it's need to make iterations for corners, for example, in a range -180° to +180°? By what principle to select quantity of iterations?
Post 03 Mar 2007, 13:32
View user's profile Send private message Visit poster's website MSN Messenger ICQ Number Reply with quote
Helle



Joined: 24 Feb 2007
Posts: 23
Location: Germany
Helle 03 Mar 2007, 18:19
Hello Garthower,
the quantity of iterations is the precision which you want. I probe with a max-range for a given program with the precision which I need. Is the SSE-speed better as FPU - o.K., if not, I take FPU.
For instance, for monitor-displaying (display-pixel in a game, ballistic in a shooter) are 7-8 significant digits enough. This is faster with SSE-iterations as the FPU!
BTW: The (appr.)FPU-precision is 7 significant digits for Single (32-bit) and 15 for Double (64-bit).
Example for a test with Sine(180°), max.12 iterations:
Code:
format PE CONSOLE 4.0 
entry start 

Include 'win32a.inc'   

section '.code' code readable writeable executable
         string db 'Sine value from Pi approximative (Rad)',0
         szEnv  db '%s %2.1d Iterations: %5.15f ',13,10,0
align 16 
        RF1     dq  8.333333333333333333e-3      ; 1/5!    2.Iteration
                dq -1.666666666666666667e-1      ;-1/3!    1.Iteration
                dq  2.755731922398589065e-6      ; 1/9!    4.Iteration
                dq -1.984126984126984127e-4      ;-1/7!    3.Iteration
                dq  1.605904383682161460e-10     ; 1/13!   6.Iteration
                dq -2.505210838544171878e-8      ;-1/11!   5.Iteration
                dq  2.811457254345520763e-15     ; 1/17!   8.Iteration
                dq -7.647163731819816476e-13     ;-1/15!   7.Iteration
                dq  1.957294106339126123e-20     ; 1/21!  10.Iteration
                dq -8.220635246624329717e-18     ;-1/19!   9.Iteration
                dq  6.446950284384473396e-26     ; 1/25!  12.Iteration
                dq -3.868170170630684038e-23     ;-1/23!  11.Iteration

        Rad     dq  3.1415926535897932384626433832795      ;Pi, Sin(Pi)=0.0 (Rad)
        Result  dq  ?
        Iter    dd  2   ;minimum

start:
; Sin(x) = x - (x^3)/3! + (x^5)/5! - (x^7)/7! + (x^9)/9! - (x^11)/11! + (x^13)/13! - + ... 
; SSE2
; Test-Example for x=Pi
L1:
     movsd xmm0,[Rad]        ;1 = x    1 = Quad-Word1  2 = Quad-Word2
     movhpd xmm0,[Rad]       ;1 and 2 = x 
     movapd xmm5,xmm0        ;save XMM0
     mulpd xmm0,xmm0         ;XMM0: 1 = x^2   2 = x^2
     movapd xmm1,xmm0        ;XMM1: 1 = x^2   2 = x^2 
     mulsd xmm0,xmm1         ;XMM0: 1 = x^4   2 = x^2 
     mulpd xmm1,xmm1         ;XMM1: 1 = x^4   2 = x^4 
     mulpd xmm0,xmm5         ;XMM0: 1 = x^5   2 = x^3 
     movq xmm3,[Rad]         ;XMM3: 1 = x     2 = 0.0 
     movapd xmm2,xmm0        ;XMM2 = XMM0 

     mov ecx,[Iter]
     shl ecx,3               ;*16 Bytes / 2 Iterations per Loop =8
     lea esi,[RF1]
     add esi,ecx
     neg ecx

@@:  movupd xmm4,[esi+ecx]
     mulpd xmm0,xmm4 
     addpd xmm3,xmm0 
     mulpd xmm2,xmm1 
     movapd xmm0,xmm2 
     add ecx,16 
     jnz @r 

     movhlps xmm4,xmm3 
     addsd xmm4,xmm3 
     movsd qword[Result],xmm4 

     invoke printf,szEnv, string, dword[Iter], dword[Result], dword[Result+4]

     add [Iter],2
     cmp [Iter],12           ;MaxIter for this Example
     jbe L1

     invoke getchar
     invoke ExitProcess,0 

section '.idata' import data readable writeable   

  library kernel32,'kernel32.dll',\   
          msvcrt,'msvcrt.dll'   

  import kernel32,\   
         ExitProcess,'ExitProcess'   

  import msvcrt,\   
         getchar,'getchar',\   
         printf,'printf'                                                                                
                                                                                                                       

Gruss
Helle
Post 03 Mar 2007, 18:19
View user's profile Send private message Reply with quote
Garthower



Joined: 21 Apr 2006
Posts: 158
Location: Ukraine
Garthower 04 Mar 2007, 14:27
But MCD's code for cosine calculating is universal for all range numbers without any floating iterations...
Post 04 Mar 2007, 14:27
View user's profile Send private message Visit poster's website MSN Messenger ICQ Number Reply with quote
Goplat



Joined: 15 Sep 2006
Posts: 181
Goplat 04 Mar 2007, 20:09
I've found that trying to calculate the sine/cosine of a single number using parallel SSE is actually slower than doing the same approximation using purely scalar SSE/FPU code. If it's possible to parallelize at a higher level (calculate two cosines at once) that would be best.
Post 04 Mar 2007, 20:09
View user's profile Send private message Reply with quote
asmfan



Joined: 11 Aug 2006
Posts: 392
Location: Russian
asmfan 04 Mar 2007, 21:10
Approximations of sin, cos, etc. are based on Maclaurin's (Taylor's) series, which is basically infinite ;P The more members of serie you take, the more accurate will be result.
Post 04 Mar 2007, 21:10
View user's profile Send private message Reply with quote
MCD



Joined: 21 Aug 2004
Posts: 602
Location: Germany
MCD 17 Jan 2008, 16:22
I'm just dropping by to complete an older statement which I blamefully left unexplained for quiet some time now Embarassed

MCD wrote:

Firstly, for certain input angles, it's getting precisions problems (may be perhaps solved).

Actually, my algorithm is numerically unstable for values close to pi/2 + k*pi. This is due to a loss of significance by the subtraction which is required to split the angle into a bigger, divided part (for the look-up table) and remainder part(for the polynomial-series approximation):
(see http://en.wikipedia.org/wiki/Loss_of_significance if you want to know something about loss of significance.)
MCD wrote:

Even If the idea of combining a look up table and a taylor iteration with a faster converting, smaller remainder for trigonometrical calculations based upon the addion theorem is very good, it's still doesn't do it's job better.

I actually wrote _polynomial_-series and not _Taylor_, because the series are not really the Taylo-series, but a better approximation.
I used a kind of minimum-maxDelta approximation with the one parameter fixed at 0 and only the other one left to be optimized (I remeber that I have found out these values by hand with a nested intervall algorithm).

If I should ever recalculate these values, I think it would be more wise to chose some Tschebyscheff-interpolation with the 2 point being variable(and thus even more precise) instead of some hand-crafted approximation.

I don't think I'm going to rewrite this code soon due to lack of time.

_________________
MCD - the inevitable return of the Mad Computer Doggy

-||__/
.|+-~
.|| ||
Post 17 Jan 2008, 16:22
View user's profile Send private message Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4353
Location: Now
edfed 18 Jan 2008, 00:44
how is the algorythm to obtain cosine with only + - * and / ?
anybody know?
same question for square root, logarythm and neperian logarythm...
ouch!!
Post 18 Jan 2008, 00:44
View user's profile Send private message Visit poster's website Reply with quote
asmfan



Joined: 11 Aug 2006
Posts: 392
Location: Russian
asmfan 18 Jan 2008, 07:16
As stated above Taylor series on wikipedia.
You can easily write an iterative procedure to count sin,cos etc. with needed precision with only * and + and compare to needed loss error on each iteration -> |F(n+1) - F(n)| < Error this could be the condition of exit from iteration loop.

_________________
Any offers?
Post 18 Jan 2008, 07:16
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 18 Jan 2008, 09:15
It seems that you often miss the crucial point in non-ARM architecture optimizations. When on a x86, then instructions with register,memory operands are faster than register,register because of latencies. There have been many occasions and proofs on these boards where this holds true.
http://board.flatassembler.net/topic.php?t=5122&postdays=0&postorder=asc&start=120
Xorpd! wrote:

As usual, an optimizer is never happy to leave things the way they are, and even though going back to a register temporary is twice as fast on phenom, it's a little slower on Core 2 Duo. The solution may be to store to a different memory location for each instruction stream (TESTME = 3). This is just as fast on Core 2 Duo as reusing the same memory location.
Code:
temporary TESTME executable M iter./s
   [r11]     1   testa_X3.exe 1645.144
   xmm2      2   testb_X3.exe 1611.091
   [r11+x]   3   testc_X3.exe 1645.144
    



I see many tries to cut down the instruction count, while not seeing the most obvious. When you load the two parts of an XMM register separately, its bad, but when you load something into it and then shuffle, its even worst (as discovered). The problem is in the latency which comes from the dependency. The most straight-forward and also valid way is to shuffle it straight from the memory and on-the-fly.

The sample I'm going to analyze is this:
Code:
     movsd xmm0,[Rad]        ;1 = x    1 = Quad-Word1  2 = Quad-Word2
     movhpd xmm0,[Rad]       ;1 and 2 = x
     movapd xmm5,xmm0        ;save XMM0
     mulpd xmm0,xmm0         ;XMM0: 1 = x^2   2 = x^2
    

Note: All sources to the amount of latency come from Agner Fog's manual instruction_tables.pdf and the timings are for Core 2

There's a scalar load, then a LATENCY because xmm0 is not ready writing. There's movhpd then which completes the xmm0...and then a LATENCY again. xmm5 is waiting for operations to complete on the xmm0 before it can read the results. Multiply itself has a long latency. This sequence of instructions therefore can never be completed under 10 clocks because there's no early-out that gives other instructions the ability to start in the same clock.

So coming from 24 down to 22 bytes (32-bit mode):
Code:
     pshufd xmm0,[Rad],01000100b ; xmm0 loaded with 2 QWORDs
     pshufd xmm5,[Rad],01000100b ; xmm5 loaded with 2 QWORDs
     mulpd  xmm0,xmm0            ; XMM0: 1 = x^2   2 = x^2
    

Its 2 bytes shorter though it seems to be a waste using memory operations twice. The trick here is that pshufd with registers has a 3 clock additional latency, while with memory operand, it doesn't. They can't execute in parallel because they both use port 2 (which Core 2 has only one), but the multiply can start as early as on the third clock. Of course if you need to fit it in less space then use registers for memory reference:
Code:
; 14 bytes, don't use esp because it means one extra byte
     pshufd xmm0,[eax],01000100b ; xmm0 loaded with 2 QWORDs
     pshufd xmm5,[eax],01000100b ; xmm5 loaded with 2 QWORDs
     mulpd  xmm0,xmm0            ; XMM0: 1 = x^2   2 = x^2
    

Now the most crucial point remains. The mulpd with register operands is claimed to have 5 (FIVE!!!) clock latency when with a memory operand it only has ... none Smile

What we should do is change it to mulpd xmm0,[mem] and this can be done by having our value ready in memory. Some tweaks here and there and a few optimizations later what we have now is this:
Code:
;17 bytes
     mov    eax,Rad    ;eax is the reference to Rad with 2 QWORDs
     movapd xmm0,[eax] ;xmm0 loaded with 2 QWORDs
     movapd xmm5,xmm0 ;xmm5 ...
     mulpd  xmm0,[eax] ;xmm0*xmm0

Rad: dq 3.1415926535897932384626433832795
     dq 3.1415926535897932384626433832795
    

Here I sacrificed 8 bytes of memory for another Pi but from 24 to 17 bytes its 7 bytes less code. We lost only 1 byte Smile
The speed? mov eax,mem can be and should be done only once before the loop. Because it has a one-clock latency, the first loop can begin from the second clock. The other iterations have eax initialized already and if we assume clock starting from the L1: (the first SSE instruction) then we get first load in the first clock. Here is why I reverted back to reg,reg version of movapd. When all these three instructions were reg,mem then it would take 3 clocks because of the heavy load on port 2. Now that movapd reg,reg takes that off, the multiply can begin in parallel with the load in the second clock.
EDIT: Lets trust the reordering unit. Movapd should choose 1 or 5 to get multiply to port 0 as requested.

EDIT: The result is that we came from the roughly 10+ clocks down to 2 or 3 depending on out-of-order unit!


Last edited by Madis731 on 18 Jan 2008, 21:28; edited 2 times in total
Post 18 Jan 2008, 09:15
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20454
Location: In your JS exploiting you and your system
revolution 18 Jan 2008, 11:53
Don't forget that the C2 can decode up to 4 instructions in one clock only if they are in the same 16byte aligned memory. Your last piece of code:
Code:
;17 bytes
     mov    eax,Rad    ;eax is the reference to Rad with 2 QWORDs
     movapd xmm0,[eax] ;xmm0 loaded with 2 QWORDs
     mulpd  xmm0,[eax] ;xmm0*xmm0
     movapd xmm5,xmm0 ;xmm5 ...

Rad: dq 3.1415926535897932384626433832795
     dq 3.1415926535897932384626433832795     
breaks that rule. With 17 bytes it needs two clocks minimum just to decode it. Also the value of xmm5 after this is not the same as the previous sections of code!
Post 18 Jan 2008, 11:53
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 18 Jan 2008, 12:03
Oh, damn, I was so focused on optimizing that I forgot about functionality :S I'll try to fix the previous post Wink

Btw, I tried to give a go on the MCD's best-performing code, but I got stuck. I'll think about it later, maybe I can get rid of all the dependencies:
Code:
        movsd   xmm0,[esi]      ; faster than movlpd
        mulsd   xmm0,[Rad2LUT]  ; dependency
        cvtsd2si eax,xmm0       ; dep.
        cvtsi2sd xmm1,eax       ; dep.
        subsd   xmm0,xmm1       ; dep.
        mulsd   xmm0,[LUT2Rad]  ; dep.
        movlhps xmm0,xmm0       ; dep.
        mulsd   xmm0,xmm0       ; dep.
        movapd  xmm1,xmm0       ; dep. Smile didn't know how to fix this
        movapd  xmm2,xmm0
        movlhps xmm0,xmm0       ; don't use the written reg immediately
        addsd   xmm2,[_Neg2.0]  ; interleave
        mulpd   xmm1,xmm0
        mulsd   xmm2,[_Neg0.5]  ; intl.
        and     eax,0FFh
        mulpd   xmm0,xmm1
        shl     eax,4
        mulpd   xmm1,[CosSinCoeff]
        mulpd   xmm0,[CosSinCoeff+16]
        addpd   xmm2,xmm1
        addpd   xmm2,xmm0
        mulpd   xmm2,[CosSinLUT+eax] ; dep.
        movhlps xmm1,xmm2       ; dep.
        subsd   xmm2,xmm1       ; dep.
        movlpd  [edi],xmm2      ;destination value
    

One thing I was thinking about was the [esi] at the beginning. Maybe we can shuffle it into 2 QWORDs and change first 6 instructions accordingly. Then we can CVT using memory (remember, less latency) and remove the 7th instruction (MOVLHPS).
Post 18 Jan 2008, 12:03
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.