flat assembler
Message board for the users of flat assembler.

 Index > Main > fast cubic interpolation? Goto page 1, 2  Next
Author
SomeoneNew

Joined: 12 Aug 2006
Posts: 54
SomeoneNew 10 Dec 2007, 17:56
Apart from the obvious, what else can I do to optimize a cubic interpolation routine?

I have a normal implementation of the cubic interpolation method, but... I believe it could be optimized.

Notice my code is C but I'm willing to go ASM on it for the sake of speed.

My implementation is the same as you could find in many text books, papers, etc. it follows just in case someone is willing to help me out on this one:

Code:
```double CubicInterpolate(
double y0,double y1,
double y2,double y3,
double mu)
{
double a0,a1,a2,a3,mu2;

mu2 = mu*mu;
a0 = y3 - y2 - y0 + y1;
a1 = y0 - y1 - a0;
a2 = y2 - y0;
a3 = y1;

return(a0*mu*mu2+a1*mu2+a2*mu+a3);
}    ```

(This is a paste from one of this papers, my actual implementation is float, not double, at least in here floats are faster...)

Any ideas to speed this up?

_________________
Im new, sorry if I bothered with any stupid question
10 Dec 2007, 17:56
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20210
revolution 10 Dec 2007, 18:34
1 - You don't need a3.
2 - Try working with SSE and do many computations in parallel (or SSE2 if you still want doubles).
10 Dec 2007, 18:34
DJ Mauretto

Joined: 14 Mar 2007
Posts: 464
Location: Rome,Italy
DJ Mauretto 11 Dec 2007, 11:07
Quote:

Any ideas to speed this up?

Hello,
This code isn't complicated,any C compiler with right cpu target
optimisation will be better that medium asm coder.
Don't focus your attention on speed,it's useless ,
particularly if it is simply addition,subtraction or multiplication,
20-30 plus cpu cycle aren't significant for some code lines,
software is make up of thousand code lines
11 Dec 2007, 11:07
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20210
revolution 11 Dec 2007, 11:19
DJ Mauretto wrote:
This code isn't complicated,any C compiler with right cpu target optimisation will be better that medium asm coder.
How do you define "medium asm coder"? Where does that term come from? I seriously doubt that "any C compiler" can do a good job. Some compilers are plain terrible at optimisations while others do a good job. To claim that "any" will do good optimisations is just wrong. Very few compilers are capable of automatic vectorisation, and even the few that can do it need a lot of help to get things working well. I strongly suspect you have never tried to get code like the above optimised and have always just assumed the compiler will do a good job!
DJ Mauretto wrote:
Don't focus your attention on speed,it's useless , particularly if it is simply addition,subtraction or multiplication, 20-30 plus cpu cycle aren't significant for some code lines, software is make up of thousand code lines
You are neglecting to consider how many times a piece of code is used. Depends on the job it must do. Most performance sensitive applications have only small sections that are looped over many billions of times. Things like matrix multiplies can be huge performance bottlenecks without the right optimised code. I think your advice is ill considered to say that "thousands code lines" are all going to be of equal importance to performance.
11 Dec 2007, 11:19
DJ Mauretto

Joined: 14 Mar 2007
Posts: 464
Location: Rome,Italy
DJ Mauretto 11 Dec 2007, 11:36
I only expressed my opinion,
I have seen too many people who are lost in stupid details
draw without significant improvements.
I think the most important thing is the algorithm
and of course creativity, the rest are just
nonsense unnecessary
11 Dec 2007, 11:36

Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Lets put it to a test, what does your compiler produce?
Mine (20070809 Package ID: W_CC_C_10.0.027) tells me this:
Code:
```?CubicInterpolate@@YANNNNNN@Z   PROC NEAR
; parameter 1: xmm0
; parameter 2: xmm1
; parameter 3: xmm2
; parameter 4: xmm3
; parameter 5: 96 + rsp
\$B1\$1:                          ; Preds \$B1\$0
sub       rsp, 56                                       ;5.1
movaps    XMMWORD PTR [rsp+32], xmm15                   ;5.1
movaps    xmm15, xmm0                                   ;10.14
movsd     xmm4, QWORD PTR [rsp+96]                      ;5.1
movaps    xmm5, xmm4                                    ;8.13
mulsd     xmm5, xmm4                                    ;8.13
subsd     xmm3, xmm0                                    ;9.19
subsd     xmm3, xmm2                                    ;9.14
subsd     xmm15, xmm1                                   ;10.14
subsd     xmm15, xmm3                                   ;10.19
mulsd     xmm3, xmm5                                    ;14.17
mulsd     xmm15, xmm5                                   ;14.24
subsd     xmm2, xmm0                                    ;11.14
mulsd     xmm3, xmm4                                    ;14.34
movaps    xmm0, xmm15                                   ;14.34
movaps    xmm15, XMMWORD PTR [rsp+32]                   ;14.34
ret                                                     ;14.34
ALIGN     2
```

Btw, the same result you get just by "icl test.cpp" or "icl /S /Ox /QxO /Qvec-report3 /Qip /Qipo-S test.cpp" so Intel (the compiler that is) thinks that it isn't optimizable beyond that...

As "every-average-assembly-programmer" can see, the compilers aren't the brightest stars in the sky. They do average-best code, but taking any example on by hand you can immediately see the superfluous assignments in:
Code:
```        movsd     xmm4, QWORD PTR [rsp+96]                      ;5.1
movaps    xmm5, xmm4
mulsd     xmm5, xmm4                                    ;8.13
```

EDIT:
If you arrange your code as follows:
Code:
```double CubicInterpolate(
double y0,double y1,
double y2,double y3,
double mu)
{
double a0,a1,a2;

a0 = y3 - y2 - y0 + y1;
a1 = y0 - y1 - a0;
a2 = y2 - y0;

return(a0*mu*mu*mu+a1*mu*mu+a2*mu+y1);
}
```

you'll get:
Code:
```?CubicInterpolate@@YANNNNNN@Z   PROC NEAR
; parameter 1(y0): xmm0
; parameter 2(y1): xmm1
; parameter 3(y2): xmm2
; parameter 4(y3): xmm3
; parameter 5(mu): 40 + rsp
\$B1\$1:                          ; Preds \$B1\$0
movaps    xmm4, xmm0                                    ;9.14
movsd     xmm5, QWORD PTR [rsp+40]                      ;5.1
subsd     xmm3, xmm0                                    ;8.19
subsd     xmm3, xmm2                                    ;8.14
subsd     xmm4, xmm1                                    ;9.14
subsd     xmm4, xmm3                                    ;9.19
mulsd     xmm3, xmm5                                    ;12.20
subsd     xmm2, xmm0                                    ;10.14
mulsd     xmm3, xmm5                                    ;12.38
mulsd     xmm3, xmm5                                    ;12.38
movaps    xmm0, xmm3                                    ;12.38
ret                                                     ;12.38
ALIGN     2
```

So the morale is:
LET THE COMPILER DO ITS JOB

Last edited by Madis731 on 12 Dec 2007, 09:31; edited 2 times in total
12 Dec 2007, 08:18

Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
EDIT: Obsolete - look at the reviewed and fixed version below.
[---] [---]

Don't know much C, but I threw something together and ICL was glad (don't know if its correct...). I made it use float (single) which made it even more optimal:

[---] [---]

Last edited by Madis731 on 13 Dec 2007, 07:15; edited 1 time in total
12 Dec 2007, 09:06
DJ Mauretto

Joined: 14 Mar 2007
Posts: 464
Location: Rome,Italy
DJ Mauretto 12 Dec 2007, 10:55
Quote:

Lets put it to a test, what does your compiler produce?

My compiler not produce nothing simply because i don't use or have
none compiler
I made time ago some test on M\$ C compiler on free DDK package and i'm
impress very very good optimisation.
My opinion remain always the same:"C compiler are better than average
asm programmer"
Of course you must know C language and Compiler behavior very well,
but it's not fundamental for a common program.
12 Dec 2007, 10:55
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20210
revolution 12 Dec 2007, 21:39
then it gave me optimized assembly for 4 calculations in parallel - I think
All those instructions are scalar, meaning only a single computation is done during each loop. Your compiler cannot vectorise that code into a 4 per loop routine, you have not given anywhere near enough clues to the compiler to do that.
12 Dec 2007, 21:39

Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Oh - sorry what should I do? ^o)
13 Dec 2007, 07:06

Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Okay - sorry, I looked into it a bit:
Code:
```float CubicInterpolate(
float y0[],float y1[],
float y2[],float y3[],
float mu[])
{
float a0,a1,a2,ret[4];

for(int i=0;i<4;i++)
{
a0 = y3[i] - y2[i] - y0[i] + y1[i]; //y1-y0 => pxor a1,a1; sub a1,a0; sub a1,y1-y0
a1 = y0[i] - y1[i] - a0;
a2 = y2[i] - y0[i];
ret[i]=((a0*mu[i]+a1)*mu[i]+a2)*mu[i]+y1[i];
}

return *ret;
}
```

This code results:
Code:
```test4.cpp
test4.cpp(8): (col. 4) remark: LOOP WAS VECTORIZED.
```

on a commandline.

The assembly output looks like:
Code:
```; parameter 1(y0): rcx
; parameter 2(y1): rdx
; parameter 3(y2): r8
; parameter 4(y3): r9
; parameter 5(mu): 96 + rsp
\$B1\$1:                          ; Preds \$B1\$0
sub       rsp, 56                                       ;5.1
movaps    XMMWORD PTR [rsp+32], xmm15                   ;5.1
movups    xmm4, XMMWORD PTR [r8]                        ;10.15
movups    xmm1, XMMWORD PTR [r9]                        ;10.15
movups    xmm3, XMMWORD PTR [rcx]                       ;10.23
movups    xmm0, XMMWORD PTR [rdx]                       ;10.31
movaps    xmm2, xmm3                                    ;11.15
subps     xmm1, xmm4                                    ;10.15
mov       rax, QWORD PTR [rsp+96]                       ;5.1
movups    xmm5, XMMWORD PTR [rax]                       ;13.14
subps     xmm1, xmm3                                    ;10.23
movaps    xmm15, xmm1                                   ;13.14
mulps     xmm15, xmm5                                   ;13.14
subps     xmm2, xmm0                                    ;11.15
subps     xmm2, xmm1                                    ;11.23
mulps     xmm15, xmm5                                   ;13.24
subps     xmm4, xmm3                                    ;12.15
mulps     xmm15, xmm5                                   ;13.34
movaps    xmm0, xmm15                                   ;16.12
movaps    xmm15, XMMWORD PTR [rsp+32]                   ;16.12
ret                                                     ;16.12
```

And now we start optimizing by letting the compiler learn that there are replicated y1-y0 operations:

Code:
```...
float a0[4],a1[4],a2[4],tmp[4],ret[4];

for(int i=0;i<4;i++)
{
tmp[i]= y1[i] - y0[i];
a0[i] = y3[i] - y2[i] + tmp[i];
tmp[i]= tmp[i] + a0[i];
a2[i] = y2[i] - y0[i];
ret[i]=((a0[i]*mu[i]-tmp[i])*mu[i]+a2[i])*mu[i]+y1[i];
}
...
```

to get something like this:
Code:
```        movups    xmm1, XMMWORD PTR [rcx]                       ;10.18
movups    xmm5, XMMWORD PTR [rdx]                       ;10.18
movups    xmm3, XMMWORD PTR [r8]                        ;11.18
movups    xmm0, XMMWORD PTR [r9]                        ;11.18
movaps    xmm2, xmm5                                    ;10.18
subps     xmm0, xmm3                                    ;11.18
mov       rax, QWORD PTR [rsp+40]                       ;5.1
movups    xmm4, XMMWORD PTR [rax]                       ;15.17
subps     xmm2, xmm1                                    ;10.18
mulps     xmm0, xmm4                                    ;15.17
subps     xmm3, xmm1                                    ;14.18
subps     xmm0, xmm2                                    ;15.23
mulps     xmm0, xmm4                                    ;15.31
mulps     xmm0, xmm4                                    ;15.44
ret                                                     ;18.12
```

1) 5 unaligned MOVs are bad, but can be gotten rid of with proper alignment
2) 1 obligatory MOV to a register
4) 3 packed MULs
==
18 instructions in total (inner loop - without call/ret) having 11 calculation instructions (that actually do all the work) which should be pretty much minimal considering the original algorithm had 12 operations. While my shufling through tmp[] makes it more optimal with 11 operations - it cannot be implemented in fewer instructions because of tmp[] usage scenario.
13 Dec 2007, 07:14
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20210
revolution 13 Dec 2007, 08:46
Madis731: You're compiler has done a reasonably good job, but throughput can still be quadrupled (on the right processor). Think about doing more than 4 computations per loop. I would be delighted to see if a compiler can pick up on the technique and manage to fully optimise it.
13 Dec 2007, 08:46

Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
You mean - give it 16 source tasks and see if it could fit it all in one loop - unroll? Of course it can, but I think that it doesn't gain from the unroll. The reason is that all the add/subs are 1uop and muls are 1-2clocks (3-4uops). It would be the same on average with for loop 4 times or 4 times longer for loop

BUT, I will see...

EDIT: OK, checked - the code is same, but there are variations like
mov rax, QWORD PTR [rsp+40]
is initialized in the prolog (r11 used) and rax is used as counter instead.

What IS interesting is the code you see now (when loop numbers get REALLY big) where "call __chkstk" is added:
Code:
```?CubicInterpolate@@YAMQEAM0000@Z        PROC NEAR
; parameter 1(y0): rcx
; parameter 2(y1): rdx
; parameter 3(y2): r8
; parameter 4(y3): r9
; parameter 5(mu): 160080 + rsp
\$B1\$1:                          ; Preds \$B1\$0
mov       eax, 160040                                   ;7.1
call      __chkstk                                      ;7.1
sub       rsp, 160040                                   ;7.1
xor       r10d, r10d                                    ;
mov       r11, QWORD PTR [rsp+160080]                   ;7.1
xor       eax, eax                                      ;
; LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 r15 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
\$B1\$2:                          ; Preds \$B1\$2 \$B1\$1
movups    xmm0, XMMWORD PTR [rax+rcx]                   ;12.18
movups    xmm4, XMMWORD PTR [rax+rdx]                   ;12.18
movups    xmm2, XMMWORD PTR [rax+r8]                    ;13.18
movups    xmm5, XMMWORD PTR [rax+r9]                    ;13.18
movups    xmm3, XMMWORD PTR [rax+r11]                   ;16.17
movaps    xmm1, xmm4                                    ;12.18
subps     xmm5, xmm2                                    ;13.18
subps     xmm2, xmm0                                    ;15.18
subps     xmm1, xmm0                                    ;12.18
mulps     xmm5, xmm3                                    ;16.17
subps     xmm5, xmm1                                    ;16.23
mulps     xmm5, xmm3                                    ;16.31
mulps     xmm5, xmm3                                    ;16.44
movntps   XMMWORD PTR [rsp+r10+32], xmm5                ;16.2
cmp       rax, 160000                                   ;10.4
jl        \$B1\$2         ; Prob 99%                      ;10.4
; LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 r15 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
\$B1\$3:                          ; Preds \$B1\$2
movss     xmm0, DWORD PTR [rsp+32]                      ;19.12
ret                                                     ;19.12
ALIGN     2
```

Btw, does anyone see opportunities to make Intel C++ use F32vec4 (http://download.intel.com/support/performancetools/c/linux/v9/cref_cls.pdf)
to make it "understand" what I'm trying to achieve...
13 Dec 2007, 12:12
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20210
revolution 13 Dec 2007, 13:25
Madis731: It is clear that your compiler did not see the opportunity to optimise this further. It leaves the dependency chain on the result register and wastes the other available registers by simply ignoring them. I think the only way to further optimise this is with assembly. This is where the compilers fall down. The simple reason being that they are not thinking programs.
13 Dec 2007, 13:25

Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
By that you mean:
1) xmm5 has a 5 (or 6) instruction long dependency just before store
2) other registers like rax rdx rcx rbx rbp rsi rdi r8..r15 xmm6..xmm15 unused
3) maybe even use different registers for a second run and interleave those two sub-functions to lessen the dependency penalty.

?

btw, I tested one "force-unroll-by-4" scenario and it gave back ~4x longer inner loop and actually did 16 calculations in it, but I was really surprised when it made rather good interleaves there. Every next stage started loading values just before making the store to our current stage so the throughput should be better than the code you see 2 posts back.
and then the stage2 calculations start (73 lines altogether which is 19*4-3 where 3 comes from the loads saved because of only one initial load of [rax+r11]).
13 Dec 2007, 13:48
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20210
revolution 13 Dec 2007, 14:05
Madis731: Yes to all three points above. On a Penryn processor (not yet in the mainstream) the 4 way execution scheduler would be able to attain 4 instructions per clock cycle with the proper code. This could come at no cost to other lesser capable processors running the same code.
13 Dec 2007, 14:05
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20210
revolution 14 Dec 2007, 13:53
BTW: The 4 instructions per clock cycle is for a single core. If you can also multi-thread the code on all 4 cores then the throughput is sixteen times the basic vectorised compiler code. Other issues can come into effect, like memory bottlenecks and thread synchronising but those are very advanced topics whose only purpose is to give programmers headaches.

This would have to be a very special case of the code being run trillions of times to make the multi-threading and hand crafted assembly optimisations worth the effort in a commercial setting. But for a hobbyist this type of effort can be very rewarding for learning purposes regardless of the actual performance increase of the overall program.
14 Dec 2007, 13:53
Xorpd!

Joined: 21 Dec 2006
Posts: 161
Xorpd! 14 Dec 2007, 21:28
I've been taking a look at this thread and thought I might post a couple of FASM versions of this code. cubic1 is optimized for latency (cycle counts are estimated for Core 2 Duo) while cubic2 is optimized for fewest operations. cubic3 and cubic4 are translations to FASM of code posted earlier in this thread. One thing that is striking about the icl code is that it doesn't seem to reorder operations to minimize latency although intel's optimizer is capable of doing so. Perhaps more experimentation with compiler options is in order. In fact one of the examples above even copied a register in the middle of a long dependency chain (costs 3 cycles on Core 2 Duo) for no reason.
Code:
`    `

Even Igor seems to have lessened his hero worship of Intel's compiler technology and taken to bitching on the Intel forum about how crappy their compiler is.

I haven't done anything with the vectorized versions for the simple reason that I think that if you wanted to do a lot of interpolations you would most likely run a cubic spline through all your data so as to attain continuity through the second derivative and so you wouldn't be recalculating coefficients as this method, which only is continuous through the first derivative, does.
14 Dec 2007, 21:28
DJ Mauretto

Joined: 14 Mar 2007
Posts: 464
Location: Rome,Italy
DJ Mauretto 15 Dec 2007, 12:02
I don't understand why you continue with float or double a and y,
a and y are integer ,only mu ( alias step ) is float
Code:
```y0    DD ?
y1      DD ?
y2      DD ?
y3      DD ?
a0      DD ?
a1      DD ?
a2      DD ?
step    DD ?

mov     edx, [y0]               ; EDX = y0
mov     ecx, [y2]               ; ECX = y2
mov     esi, [y1]               ; ESI = y1
mov     edi, [y3]               ; EDI = y3
mov     eax, esi                ; EAX = y1
sub     eax, edx                ; y1 - y0
sub     eax, ecx                ; y1 - y0 - y2
add     eax, edi                ; y1 - y0 - y2 + y3
mov     [a0],eax
fild    [a0]                    ; a0
mov     edi, edx                ; EDI = y0
sub     edi, esi                ; y0 - y1
fmul    [step]                  ; a0 * step
sub     edi, eax                ; y0 - y1 - a0
mov     [a1],edi
sub     ecx, edx                ; y2 - y0
fiadd   [a1]                    ; a0 * step + a1
mov     [a2],ecx
fmul    [step]                  ; a0 * step + a1 * step
fiadd   [a2]                    ; a0 * step + a1 * step + a2
fmul    [step]                  ; a0 * step + a1 * step + a2 * step
fiadd   [y1]                    ; a0 * step + a1 * step + a2 * step + y1
fistp   [Result]                ; Have fun with C !!     ```
15 Dec 2007, 12:02
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20210
revolution 15 Dec 2007, 13:47
DJ Mauretto: How do you come to the conclusion that all the inputs are integer? The OP stated the inputs were double in the original code, and also mentioned that that may change to float.

Also, if, as you state, all inputs are actually integer and the only operations are add, sub and mul then why do you suddenly need float for mu?

You code is also much slower by using the FPU, remember the OP wanted to optimise it, not slow it down. You code has many bugs and won't give you the correct answer, try it if you don't believe me.
15 Dec 2007, 13:47
 Display posts from previous: All Posts1 Day7 Days2 Weeks1 Month3 Months6 Months1 Year Oldest FirstNewest First

 Jump to: Select a forum Official----------------AssemblyPeripheria General----------------MainTutorials and ExamplesDOSWindowsLinuxUnixMenuetOS Specific----------------MacroinstructionsOS ConstructionIDE DevelopmentProjects and IdeasNon-x86 architecturesHigh Level LanguagesProgramming Language DesignCompiler Internals Other----------------FeedbackHeapTest Area
Goto page 1, 2  Next

Forum Rules:
 You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot vote in polls in this forumYou cannot attach files in this forumYou can download files in this forum