flat assembler
Message board for the users of flat assembler.

Index > Main > fast cubic interpolation?

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
SomeoneNew



Joined: 12 Aug 2006
Posts: 54
SomeoneNew 10 Dec 2007, 17:56
Apart from the obvious, what else can I do to optimize a cubic interpolation routine?

I have a normal implementation of the cubic interpolation method, but... I believe it could be optimized.

Notice my code is C but I'm willing to go ASM on it for the sake of speed.

My implementation is the same as you could find in many text books, papers, etc. it follows just in case someone is willing to help me out on this one:

Code:
double CubicInterpolate(
   double y0,double y1,
   double y2,double y3,
   double mu)
{
   double a0,a1,a2,a3,mu2;

   mu2 = mu*mu;
   a0 = y3 - y2 - y0 + y1;
   a1 = y0 - y1 - a0;
   a2 = y2 - y0;
   a3 = y1;

   return(a0*mu*mu2+a1*mu2+a2*mu+a3);
}    


(This is a paste from one of this papers, my actual implementation is float, not double, at least in here floats are faster...)

Any ideas to speed this up?

_________________
Im new, sorry if I bothered with any stupid question Smile
Post 10 Dec 2007, 17:56
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20445
Location: In your JS exploiting you and your system
revolution 10 Dec 2007, 18:34
1 - You don't need a3.
2 - Try working with SSE and do many computations in parallel (or SSE2 if you still want doubles).
Post 10 Dec 2007, 18:34
View user's profile Send private message Visit poster's website Reply with quote
DJ Mauretto



Joined: 14 Mar 2007
Posts: 464
Location: Rome,Italy
DJ Mauretto 11 Dec 2007, 11:07
Quote:

Any ideas to speed this up?

Hello,
This code isn't complicated,any C compiler with right cpu target
optimisation will be better that medium asm coder.
Don't focus your attention on speed,it's useless ,
particularly if it is simply addition,subtraction or multiplication,
20-30 plus cpu cycle aren't significant for some code lines,
software is make up of thousand code lines Wink
Post 11 Dec 2007, 11:07
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20445
Location: In your JS exploiting you and your system
revolution 11 Dec 2007, 11:19
DJ Mauretto wrote:
This code isn't complicated,any C compiler with right cpu target optimisation will be better that medium asm coder.
How do you define "medium asm coder"? Where does that term come from? I seriously doubt that "any C compiler" can do a good job. Some compilers are plain terrible at optimisations while others do a good job. To claim that "any" will do good optimisations is just wrong. Very few compilers are capable of automatic vectorisation, and even the few that can do it need a lot of help to get things working well. I strongly suspect you have never tried to get code like the above optimised and have always just assumed the compiler will do a good job!
DJ Mauretto wrote:
Don't focus your attention on speed,it's useless , particularly if it is simply addition,subtraction or multiplication, 20-30 plus cpu cycle aren't significant for some code lines, software is make up of thousand code lines Wink
You are neglecting to consider how many times a piece of code is used. Depends on the job it must do. Most performance sensitive applications have only small sections that are looped over many billions of times. Things like matrix multiplies can be huge performance bottlenecks without the right optimised code. I think your advice is ill considered to say that "thousands code lines" are all going to be of equal importance to performance.
Post 11 Dec 2007, 11:19
View user's profile Send private message Visit poster's website Reply with quote
DJ Mauretto



Joined: 14 Mar 2007
Posts: 464
Location: Rome,Italy
DJ Mauretto 11 Dec 2007, 11:36
I only expressed my opinion,
I have seen too many people who are lost in stupid details
draw without significant improvements.
I think the most important thing is the algorithm
and of course creativity, the rest are just
nonsense unnecessary Wink
Post 11 Dec 2007, 11:36
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 12 Dec 2007, 08:18
Lets put it to a test, what does your compiler produce?
Mine (20070809 Package ID: W_CC_C_10.0.027) tells me this:
Code:
?CubicInterpolate@@YANNNNNN@Z   PROC NEAR
; parameter 1: xmm0
; parameter 2: xmm1
; parameter 3: xmm2
; parameter 4: xmm3
; parameter 5: 96 + rsp
$B1$1:                          ; Preds $B1$0
        sub       rsp, 56                                       ;5.1
        movaps    XMMWORD PTR [rsp+32], xmm15                   ;5.1
        movaps    xmm15, xmm0                                   ;10.14
        movsd     xmm4, QWORD PTR [rsp+96]                      ;5.1
        movaps    xmm5, xmm4                                    ;8.13
        mulsd     xmm5, xmm4                                    ;8.13
        subsd     xmm3, xmm0                                    ;9.19
        addsd     xmm3, xmm1                                    ;9.24
        subsd     xmm3, xmm2                                    ;9.14
        subsd     xmm15, xmm1                                   ;10.14
        subsd     xmm15, xmm3                                   ;10.19
        mulsd     xmm3, xmm5                                    ;14.17
        mulsd     xmm15, xmm5                                   ;14.24
        subsd     xmm2, xmm0                                    ;11.14
        addsd     xmm3, xmm2                                    ;14.34
        mulsd     xmm3, xmm4                                    ;14.34
        addsd     xmm15, xmm1                                   ;14.34
        addsd     xmm15, xmm3                                   ;14.34
        movaps    xmm0, xmm15                                   ;14.34
        movaps    xmm15, XMMWORD PTR [rsp+32]                   ;14.34
        add       rsp, 56                                       ;14.34
        ret                                                     ;14.34
        ALIGN     2
    


Btw, the same result you get just by "icl test.cpp" or "icl /S /Ox /QxO /Qvec-report3 /Qip /Qipo-S test.cpp" so Intel (the compiler that is) thinks that it isn't optimizable beyond that...

As "every-average-assembly-programmer" can see, the compilers aren't the brightest stars in the sky. They do average-best code, but taking any example on by hand you can immediately see the superfluous assignments in:
Code:
        movsd     xmm4, QWORD PTR [rsp+96]                      ;5.1
        movaps    xmm5, xmm4
        mulsd     xmm5, xmm4                                    ;8.13
    


EDIT:
If you arrange your code as follows:
Code:
double CubicInterpolate(
   double y0,double y1,
   double y2,double y3,
   double mu)
{
   double a0,a1,a2;

   a0 = y3 - y2 - y0 + y1;
   a1 = y0 - y1 - a0;
   a2 = y2 - y0;

   return(a0*mu*mu*mu+a1*mu*mu+a2*mu+y1);
}
    

you'll get:
Code:
?CubicInterpolate@@YANNNNNN@Z   PROC NEAR
; parameter 1(y0): xmm0
; parameter 2(y1): xmm1
; parameter 3(y2): xmm2
; parameter 4(y3): xmm3
; parameter 5(mu): 40 + rsp
$B1$1:                          ; Preds $B1$0
        movaps    xmm4, xmm0                                    ;9.14
        movsd     xmm5, QWORD PTR [rsp+40]                      ;5.1
        subsd     xmm3, xmm0                                    ;8.19
        addsd     xmm3, xmm1                                    ;8.24
        subsd     xmm3, xmm2                                    ;8.14
        subsd     xmm4, xmm1                                    ;9.14
        subsd     xmm4, xmm3                                    ;9.19
        mulsd     xmm3, xmm5                                    ;12.20
        subsd     xmm2, xmm0                                    ;10.14
        addsd     xmm3, xmm4                                    ;12.38
        mulsd     xmm3, xmm5                                    ;12.38
        addsd     xmm3, xmm2                                    ;12.38
        mulsd     xmm3, xmm5                                    ;12.38
        addsd     xmm3, xmm1                                    ;12.38
        movaps    xmm0, xmm3                                    ;12.38
        ret                                                     ;12.38
        ALIGN     2
    


So the morale is:
LET THE COMPILER DO ITS JOB Very Happy


Last edited by Madis731 on 12 Dec 2007, 09:31; edited 2 times in total
Post 12 Dec 2007, 08:18
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 12 Dec 2007, 09:06
EDIT: Obsolete - look at the reviewed and fixed version below.
[---] [---]

Don't know much C, but I threw something together and ICL was glad (don't know if its correct...). I made it use float (single) which made it even more optimal:

[---] [---]


Last edited by Madis731 on 13 Dec 2007, 07:15; edited 1 time in total
Post 12 Dec 2007, 09:06
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
DJ Mauretto



Joined: 14 Mar 2007
Posts: 464
Location: Rome,Italy
DJ Mauretto 12 Dec 2007, 10:55
Quote:

Lets put it to a test, what does your compiler produce?

My compiler not produce nothing simply because i don't use or have
none compiler Very Happy
I made time ago some test on M$ C compiler on free DDK package and i'm
impress very very good optimisation.
My opinion remain always the same:"C compiler are better than average
asm programmer"
Of course you must know C language and Compiler behavior very well,
but it's not fundamental for a common program.
Post 12 Dec 2007, 10:55
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20445
Location: In your JS exploiting you and your system
revolution 12 Dec 2007, 21:39
Madis731 wrote:
then it gave me optimized assembly for 4 calculations in parallel - I think
All those instructions are scalar, meaning only a single computation is done during each loop. Your compiler cannot vectorise that code into a 4 per loop routine, you have not given anywhere near enough clues to the compiler to do that.
Post 12 Dec 2007, 21:39
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 13 Dec 2007, 07:06
Oh - sorry Sad what should I do? ^o)
Post 13 Dec 2007, 07:06
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 13 Dec 2007, 07:14
Okay - sorry, I looked into it a bit:
Code:
float CubicInterpolate(
   float y0[],float y1[],
   float y2[],float y3[],
   float mu[])
{
   float a0,a1,a2,ret[4];

   for(int i=0;i<4;i++)
   {
        a0 = y3[i] - y2[i] - y0[i] + y1[i]; //y1-y0 => pxor a1,a1; sub a1,a0; sub a1,y1-y0
        a1 = y0[i] - y1[i] - a0;
        a2 = y2[i] - y0[i];
        ret[i]=((a0*mu[i]+a1)*mu[i]+a2)*mu[i]+y1[i];
   }

   return *ret;
}
    


This code results:
Code:
test4.cpp
test4.cpp(8): (col. 4) remark: LOOP WAS VECTORIZED.
    

on a commandline.

The assembly output looks like:
Code:
; parameter 1(y0): rcx
; parameter 2(y1): rdx
; parameter 3(y2): r8
; parameter 4(y3): r9
; parameter 5(mu): 96 + rsp
$B1$1:                          ; Preds $B1$0
        sub       rsp, 56                                       ;5.1
        movaps    XMMWORD PTR [rsp+32], xmm15                   ;5.1
        movups    xmm4, XMMWORD PTR [r8]                        ;10.15
        movups    xmm1, XMMWORD PTR [r9]                        ;10.15
        movups    xmm3, XMMWORD PTR [rcx]                       ;10.23
        movups    xmm0, XMMWORD PTR [rdx]                       ;10.31
        movaps    xmm2, xmm3                                    ;11.15
        subps     xmm1, xmm4                                    ;10.15
        mov       rax, QWORD PTR [rsp+96]                       ;5.1
        movups    xmm5, XMMWORD PTR [rax]                       ;13.14
        subps     xmm1, xmm3                                    ;10.23
        addps     xmm1, xmm0                                    ;10.31
        movaps    xmm15, xmm1                                   ;13.14
        mulps     xmm15, xmm5                                   ;13.14
        subps     xmm2, xmm0                                    ;11.15
        subps     xmm2, xmm1                                    ;11.23
        addps     xmm15, xmm2                                   ;13.20
        mulps     xmm15, xmm5                                   ;13.24
        subps     xmm4, xmm3                                    ;12.15
        addps     xmm15, xmm4                                   ;13.30
        mulps     xmm15, xmm5                                   ;13.34
        addps     xmm15, xmm0                                   ;13.40
        movaps    xmm0, xmm15                                   ;16.12
        movaps    xmm15, XMMWORD PTR [rsp+32]                   ;16.12
        add       rsp, 56                                       ;16.12
        ret                                                     ;16.12
    


And now we start optimizing by letting the compiler learn that there are replicated y1-y0 operations:

Code:
...
   float a0[4],a1[4],a2[4],tmp[4],ret[4];

   for(int i=0;i<4;i++)
   {
        tmp[i]= y1[i] - y0[i];
        a0[i] = y3[i] - y2[i] + tmp[i];
        tmp[i]= tmp[i] + a0[i];
        a2[i] = y2[i] - y0[i];
        ret[i]=((a0[i]*mu[i]-tmp[i])*mu[i]+a2[i])*mu[i]+y1[i];
   }
...
    


to get something like this:
Code:
        movups    xmm1, XMMWORD PTR [rcx]                       ;10.18
        movups    xmm5, XMMWORD PTR [rdx]                       ;10.18
        movups    xmm3, XMMWORD PTR [r8]                        ;11.18
        movups    xmm0, XMMWORD PTR [r9]                        ;11.18
        movaps    xmm2, xmm5                                    ;10.18
        subps     xmm0, xmm3                                    ;11.18
        mov       rax, QWORD PTR [rsp+40]                       ;5.1
        movups    xmm4, XMMWORD PTR [rax]                       ;15.17
        subps     xmm2, xmm1                                    ;10.18
        addps     xmm0, xmm2                                    ;11.26
        addps     xmm2, xmm0                                    ;12.19
        mulps     xmm0, xmm4                                    ;15.17
        subps     xmm3, xmm1                                    ;14.18
        subps     xmm0, xmm2                                    ;15.23
        mulps     xmm0, xmm4                                    ;15.31
        addps     xmm0, xmm3                                    ;15.37
        mulps     xmm0, xmm4                                    ;15.44
        addps     xmm0, xmm5                                    ;15.50
        ret                                                     ;18.12 
    


1) 5 unaligned MOVs are bad, but can be gotten rid of with proper alignment
2) 1 obligatory MOV to a register
3) 8 packed ADDs/SUBs
4) 3 packed MULs
==
18 instructions in total (inner loop - without call/ret) having 11 calculation instructions (that actually do all the work) which should be pretty much minimal considering the original algorithm had 12 operations. While my shufling through tmp[] makes it more optimal with 11 operations - it cannot be implemented in fewer instructions because of tmp[] usage scenario.
Post 13 Dec 2007, 07:14
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20445
Location: In your JS exploiting you and your system
revolution 13 Dec 2007, 08:46
Madis731: You're compiler has done a reasonably good job, but throughput can still be quadrupled (on the right processor). Think about doing more than 4 computations per loop. I would be delighted to see if a compiler can pick up on the technique and manage to fully optimise it.
Post 13 Dec 2007, 08:46
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 13 Dec 2007, 12:12
You mean - give it 16 source tasks and see if it could fit it all in one loop - unroll? Of course it can, but I think that it doesn't gain from the unroll. The reason is that all the add/subs are 1uop and muls are 1-2clocks (3-4uops). It would be the same on average with for loop 4 times or 4 times longer for loop Smile

BUT, I will see...

EDIT: OK, checked - the code is same, but there are variations like
mov rax, QWORD PTR [rsp+40]
is initialized in the prolog (r11 used) and rax is used as counter instead.

What IS interesting is the code you see now (when loop numbers get REALLY big) where "call __chkstk" is added:
Code:
?CubicInterpolate@@YAMQEAM0000@Z        PROC NEAR
; parameter 1(y0): rcx
; parameter 2(y1): rdx
; parameter 3(y2): r8
; parameter 4(y3): r9
; parameter 5(mu): 160080 + rsp
$B1$1:                          ; Preds $B1$0
        mov       eax, 160040                                   ;7.1
        call      __chkstk                                      ;7.1
        sub       rsp, 160040                                   ;7.1
        xor       r10d, r10d                                    ;
        mov       r11, QWORD PTR [rsp+160080]                   ;7.1
        xor       eax, eax                                      ;
                                ; LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 r15 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
$B1$2:                          ; Preds $B1$2 $B1$1
        movups    xmm0, XMMWORD PTR [rax+rcx]                   ;12.18
        movups    xmm4, XMMWORD PTR [rax+rdx]                   ;12.18
        movups    xmm2, XMMWORD PTR [rax+r8]                    ;13.18
        movups    xmm5, XMMWORD PTR [rax+r9]                    ;13.18
        movups    xmm3, XMMWORD PTR [rax+r11]                   ;16.17
        movaps    xmm1, xmm4                                    ;12.18
        subps     xmm5, xmm2                                    ;13.18
        subps     xmm2, xmm0                                    ;15.18
        subps     xmm1, xmm0                                    ;12.18
        addps     xmm5, xmm1                                    ;13.26
        addps     xmm1, xmm5                                    ;14.19
        mulps     xmm5, xmm3                                    ;16.17
        add       rax, 16                                       ;10.4
        subps     xmm5, xmm1                                    ;16.23
        mulps     xmm5, xmm3                                    ;16.31
        addps     xmm5, xmm2                                    ;16.37
        mulps     xmm5, xmm3                                    ;16.44
        addps     xmm5, xmm4                                    ;16.50
        movntps   XMMWORD PTR [rsp+r10+32], xmm5                ;16.2
        add       r10, 16                                       ;10.4
        cmp       rax, 160000                                   ;10.4
        jl        $B1$2         ; Prob 99%                      ;10.4
                                ; LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 r15 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
$B1$3:                          ; Preds $B1$2
        movss     xmm0, DWORD PTR [rsp+32]                      ;19.12
        add       rsp, 160040                                   ;19.12
        ret                                                     ;19.12
        ALIGN     2
    


Btw, does anyone see opportunities to make Intel C++ use F32vec4 (http://download.intel.com/support/performancetools/c/linux/v9/cref_cls.pdf)
to make it "understand" what I'm trying to achieve...
Post 13 Dec 2007, 12:12
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20445
Location: In your JS exploiting you and your system
revolution 13 Dec 2007, 13:25
Madis731: It is clear that your compiler did not see the opportunity to optimise this further. It leaves the dependency chain on the result register and wastes the other available registers by simply ignoring them. I think the only way to further optimise this is with assembly. This is where the compilers fall down. The simple reason being that they are not thinking programs.
Post 13 Dec 2007, 13:25
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 13 Dec 2007, 13:48
By that you mean:
1) xmm5 has a 5 (or 6) instruction long dependency just before store
2) other registers like rax rdx rcx rbx rbp rsi rdi r8..r15 xmm6..xmm15 unused
3) maybe even use different registers for a second run and interleave those two sub-functions to lessen the dependency penalty.

?

btw, I tested one "force-unroll-by-4" scenario and it gave back ~4x longer inner loop and actually did 16 calculations in it, but I was really surprised when it made rather good interleaves there. Every next stage started loading values just before making the store to our current stage so the throughput should be better than the code you see 2 posts back.
The policy was something like stage2_load1;stage1_do_sth;stage2_load2;stage1_do_sth;stage2_load3;stage1_RESULT;
and then the stage2 calculations start (73 lines altogether which is 19*4-3 where 3 comes from the loads saved because of only one initial load of [rax+r11]).
Post 13 Dec 2007, 13:48
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20445
Location: In your JS exploiting you and your system
revolution 13 Dec 2007, 14:05
Madis731: Yes to all three points above. On a Penryn processor (not yet in the mainstream) the 4 way execution scheduler would be able to attain 4 instructions per clock cycle with the proper code. This could come at no cost to other lesser capable processors running the same code.
Post 13 Dec 2007, 14:05
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20445
Location: In your JS exploiting you and your system
revolution 14 Dec 2007, 13:53
BTW: The 4 instructions per clock cycle is for a single core. If you can also multi-thread the code on all 4 cores then the throughput is sixteen times the basic vectorised compiler code. Other issues can come into effect, like memory bottlenecks and thread synchronising but those are very advanced topics whose only purpose is to give programmers headaches. Wink

This would have to be a very special case of the code being run trillions of times to make the multi-threading and hand crafted assembly optimisations worth the effort in a commercial setting. But for a hobbyist this type of effort can be very rewarding for learning purposes regardless of the actual performance increase of the overall program.
Post 14 Dec 2007, 13:53
View user's profile Send private message Visit poster's website Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd! 14 Dec 2007, 21:28
I've been taking a look at this thread and thought I might post a couple of FASM versions of this code. cubic1 is optimized for latency (cycle counts are estimated for Core 2 Duo) while cubic2 is optimized for fewest operations. cubic3 and cubic4 are translations to FASM of code posted earlier in this thread. One thing that is striking about the icl code is that it doesn't seem to reorder operations to minimize latency although intel's optimizer is capable of doing so. Perhaps more experimentation with compiler options is in order. In fact one of the examples above even copied a register in the middle of a long dependency chain (costs 3 cycles on Core 2 Duo) for no reason.
Code:
    

Even Igor seems to have lessened his hero worship of Intel's compiler technology and taken to bitching on the Intel forum about how crappy their compiler is.

I haven't done anything with the vectorized versions for the simple reason that I think that if you wanted to do a lot of interpolations you would most likely run a cubic spline through all your data so as to attain continuity through the second derivative and so you wouldn't be recalculating coefficients as this method, which only is continuous through the first derivative, does.
Post 14 Dec 2007, 21:28
View user's profile Send private message Visit poster's website Reply with quote
DJ Mauretto



Joined: 14 Mar 2007
Posts: 464
Location: Rome,Italy
DJ Mauretto 15 Dec 2007, 12:02
I don't understand why you continue with float or double a and y,
a and y are integer ,only mu ( alias step ) is float Wink
Code:
y0    DD ?
y1      DD ?
y2      DD ?
y3      DD ?
a0      DD ?
a1      DD ?
a2      DD ?
step    DD ?    


    mov     edx, [y0]               ; EDX = y0
  mov     ecx, [y2]               ; ECX = y2
  mov     esi, [y1]               ; ESI = y1
  mov     edi, [y3]               ; EDI = y3
  mov     eax, esi                ; EAX = y1
  sub     eax, edx                ; y1 - y0
   sub     eax, ecx                ; y1 - y0 - y2
      add     eax, edi                ; y1 - y0 - y2 + y3
 mov     [a0],eax
    fild    [a0]                    ; a0
        mov     edi, edx                ; EDI = y0
  sub     edi, esi                ; y0 - y1
   fmul    [step]                  ; a0 * step
 sub     edi, eax                ; y0 - y1 - a0
      mov     [a1],edi                
    sub     ecx, edx                ; y2 - y0
   fiadd   [a1]                    ; a0 * step + a1
    mov     [a2],ecx                
    fmul    [step]                  ; a0 * step + a1 * step
     fiadd   [a2]                    ; a0 * step + a1 * step + a2
        fmul    [step]                  ; a0 * step + a1 * step + a2 * step
 fiadd   [y1]                    ; a0 * step + a1 * step + a2 * step + y1
    fistp   [Result]                ; Have fun with C !! Smile    
Post 15 Dec 2007, 12:02
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20445
Location: In your JS exploiting you and your system
revolution 15 Dec 2007, 13:47
DJ Mauretto: How do you come to the conclusion that all the inputs are integer? The OP stated the inputs were double in the original code, and also mentioned that that may change to float.

Also, if, as you state, all inputs are actually integer and the only operations are add, sub and mul then why do you suddenly need float for mu?

You code is also much slower by using the FPU, remember the OP wanted to optimise it, not slow it down. You code has many bugs and won't give you the correct answer, try it if you don't believe me.
Post 15 Dec 2007, 13:47
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.