flat assembler
Message board for the users of flat assembler.
Index
> Main > fast cubic interpolation? Goto page 1, 2 Next |
Author |
|
revolution 10 Dec 2007, 18:34
1 - You don't need a3.
2 - Try working with SSE and do many computations in parallel (or SSE2 if you still want doubles). |
|||
10 Dec 2007, 18:34 |
|
DJ Mauretto 11 Dec 2007, 11:07
Quote:
Hello, This code isn't complicated,any C compiler with right cpu target optimisation will be better that medium asm coder. Don't focus your attention on speed,it's useless , particularly if it is simply addition,subtraction or multiplication, 20-30 plus cpu cycle aren't significant for some code lines, software is make up of thousand code lines |
|||
11 Dec 2007, 11:07 |
|
revolution 11 Dec 2007, 11:19
DJ Mauretto wrote: This code isn't complicated,any C compiler with right cpu target optimisation will be better that medium asm coder. DJ Mauretto wrote: Don't focus your attention on speed,it's useless , particularly if it is simply addition,subtraction or multiplication, 20-30 plus cpu cycle aren't significant for some code lines, software is make up of thousand code lines |
|||
11 Dec 2007, 11:19 |
|
DJ Mauretto 11 Dec 2007, 11:36
I only expressed my opinion,
I have seen too many people who are lost in stupid details draw without significant improvements. I think the most important thing is the algorithm and of course creativity, the rest are just nonsense unnecessary |
|||
11 Dec 2007, 11:36 |
|
Madis731 12 Dec 2007, 08:18
Lets put it to a test, what does your compiler produce?
Mine (20070809 Package ID: W_CC_C_10.0.027) tells me this: Code: ?CubicInterpolate@@YANNNNNN@Z PROC NEAR ; parameter 1: xmm0 ; parameter 2: xmm1 ; parameter 3: xmm2 ; parameter 4: xmm3 ; parameter 5: 96 + rsp $B1$1: ; Preds $B1$0 sub rsp, 56 ;5.1 movaps XMMWORD PTR [rsp+32], xmm15 ;5.1 movaps xmm15, xmm0 ;10.14 movsd xmm4, QWORD PTR [rsp+96] ;5.1 movaps xmm5, xmm4 ;8.13 mulsd xmm5, xmm4 ;8.13 subsd xmm3, xmm0 ;9.19 addsd xmm3, xmm1 ;9.24 subsd xmm3, xmm2 ;9.14 subsd xmm15, xmm1 ;10.14 subsd xmm15, xmm3 ;10.19 mulsd xmm3, xmm5 ;14.17 mulsd xmm15, xmm5 ;14.24 subsd xmm2, xmm0 ;11.14 addsd xmm3, xmm2 ;14.34 mulsd xmm3, xmm4 ;14.34 addsd xmm15, xmm1 ;14.34 addsd xmm15, xmm3 ;14.34 movaps xmm0, xmm15 ;14.34 movaps xmm15, XMMWORD PTR [rsp+32] ;14.34 add rsp, 56 ;14.34 ret ;14.34 ALIGN 2 Btw, the same result you get just by "icl test.cpp" or "icl /S /Ox /QxO /Qvec-report3 /Qip /Qipo-S test.cpp" so Intel (the compiler that is) thinks that it isn't optimizable beyond that... As "every-average-assembly-programmer" can see, the compilers aren't the brightest stars in the sky. They do average-best code, but taking any example on by hand you can immediately see the superfluous assignments in: Code: movsd xmm4, QWORD PTR [rsp+96] ;5.1 movaps xmm5, xmm4 mulsd xmm5, xmm4 ;8.13 EDIT: If you arrange your code as follows: Code: double CubicInterpolate( double y0,double y1, double y2,double y3, double mu) { double a0,a1,a2; a0 = y3 - y2 - y0 + y1; a1 = y0 - y1 - a0; a2 = y2 - y0; return(a0*mu*mu*mu+a1*mu*mu+a2*mu+y1); } you'll get: Code: ?CubicInterpolate@@YANNNNNN@Z PROC NEAR ; parameter 1(y0): xmm0 ; parameter 2(y1): xmm1 ; parameter 3(y2): xmm2 ; parameter 4(y3): xmm3 ; parameter 5(mu): 40 + rsp $B1$1: ; Preds $B1$0 movaps xmm4, xmm0 ;9.14 movsd xmm5, QWORD PTR [rsp+40] ;5.1 subsd xmm3, xmm0 ;8.19 addsd xmm3, xmm1 ;8.24 subsd xmm3, xmm2 ;8.14 subsd xmm4, xmm1 ;9.14 subsd xmm4, xmm3 ;9.19 mulsd xmm3, xmm5 ;12.20 subsd xmm2, xmm0 ;10.14 addsd xmm3, xmm4 ;12.38 mulsd xmm3, xmm5 ;12.38 addsd xmm3, xmm2 ;12.38 mulsd xmm3, xmm5 ;12.38 addsd xmm3, xmm1 ;12.38 movaps xmm0, xmm3 ;12.38 ret ;12.38 ALIGN 2 So the morale is: LET THE COMPILER DO ITS JOB Last edited by Madis731 on 12 Dec 2007, 09:31; edited 2 times in total |
|||
12 Dec 2007, 08:18 |
|
Madis731 12 Dec 2007, 09:06
EDIT: Obsolete - look at the reviewed and fixed version below.
[---] [---] Don't know much C, but I threw something together and ICL was glad (don't know if its correct...). I made it use float (single) which made it even more optimal: [---] [---] Last edited by Madis731 on 13 Dec 2007, 07:15; edited 1 time in total |
|||
12 Dec 2007, 09:06 |
|
DJ Mauretto 12 Dec 2007, 10:55
Quote:
My compiler not produce nothing simply because i don't use or have none compiler I made time ago some test on M$ C compiler on free DDK package and i'm impress very very good optimisation. My opinion remain always the same:"C compiler are better than average asm programmer" Of course you must know C language and Compiler behavior very well, but it's not fundamental for a common program. |
|||
12 Dec 2007, 10:55 |
|
revolution 12 Dec 2007, 21:39
Madis731 wrote: then it gave me optimized assembly for 4 calculations in parallel - I think |
|||
12 Dec 2007, 21:39 |
|
Madis731 13 Dec 2007, 07:06
Oh - sorry what should I do? ^o)
|
|||
13 Dec 2007, 07:06 |
|
Madis731 13 Dec 2007, 07:14
Okay - sorry, I looked into it a bit:
Code: float CubicInterpolate( float y0[],float y1[], float y2[],float y3[], float mu[]) { float a0,a1,a2,ret[4]; for(int i=0;i<4;i++) { a0 = y3[i] - y2[i] - y0[i] + y1[i]; //y1-y0 => pxor a1,a1; sub a1,a0; sub a1,y1-y0 a1 = y0[i] - y1[i] - a0; a2 = y2[i] - y0[i]; ret[i]=((a0*mu[i]+a1)*mu[i]+a2)*mu[i]+y1[i]; } return *ret; } This code results: Code: test4.cpp test4.cpp(8): (col. 4) remark: LOOP WAS VECTORIZED. on a commandline. The assembly output looks like: Code: ; parameter 1(y0): rcx ; parameter 2(y1): rdx ; parameter 3(y2): r8 ; parameter 4(y3): r9 ; parameter 5(mu): 96 + rsp $B1$1: ; Preds $B1$0 sub rsp, 56 ;5.1 movaps XMMWORD PTR [rsp+32], xmm15 ;5.1 movups xmm4, XMMWORD PTR [r8] ;10.15 movups xmm1, XMMWORD PTR [r9] ;10.15 movups xmm3, XMMWORD PTR [rcx] ;10.23 movups xmm0, XMMWORD PTR [rdx] ;10.31 movaps xmm2, xmm3 ;11.15 subps xmm1, xmm4 ;10.15 mov rax, QWORD PTR [rsp+96] ;5.1 movups xmm5, XMMWORD PTR [rax] ;13.14 subps xmm1, xmm3 ;10.23 addps xmm1, xmm0 ;10.31 movaps xmm15, xmm1 ;13.14 mulps xmm15, xmm5 ;13.14 subps xmm2, xmm0 ;11.15 subps xmm2, xmm1 ;11.23 addps xmm15, xmm2 ;13.20 mulps xmm15, xmm5 ;13.24 subps xmm4, xmm3 ;12.15 addps xmm15, xmm4 ;13.30 mulps xmm15, xmm5 ;13.34 addps xmm15, xmm0 ;13.40 movaps xmm0, xmm15 ;16.12 movaps xmm15, XMMWORD PTR [rsp+32] ;16.12 add rsp, 56 ;16.12 ret ;16.12 And now we start optimizing by letting the compiler learn that there are replicated y1-y0 operations: Code: ... float a0[4],a1[4],a2[4],tmp[4],ret[4]; for(int i=0;i<4;i++) { tmp[i]= y1[i] - y0[i]; a0[i] = y3[i] - y2[i] + tmp[i]; tmp[i]= tmp[i] + a0[i]; a2[i] = y2[i] - y0[i]; ret[i]=((a0[i]*mu[i]-tmp[i])*mu[i]+a2[i])*mu[i]+y1[i]; } ... to get something like this: Code: movups xmm1, XMMWORD PTR [rcx] ;10.18 movups xmm5, XMMWORD PTR [rdx] ;10.18 movups xmm3, XMMWORD PTR [r8] ;11.18 movups xmm0, XMMWORD PTR [r9] ;11.18 movaps xmm2, xmm5 ;10.18 subps xmm0, xmm3 ;11.18 mov rax, QWORD PTR [rsp+40] ;5.1 movups xmm4, XMMWORD PTR [rax] ;15.17 subps xmm2, xmm1 ;10.18 addps xmm0, xmm2 ;11.26 addps xmm2, xmm0 ;12.19 mulps xmm0, xmm4 ;15.17 subps xmm3, xmm1 ;14.18 subps xmm0, xmm2 ;15.23 mulps xmm0, xmm4 ;15.31 addps xmm0, xmm3 ;15.37 mulps xmm0, xmm4 ;15.44 addps xmm0, xmm5 ;15.50 ret ;18.12 1) 5 unaligned MOVs are bad, but can be gotten rid of with proper alignment 2) 1 obligatory MOV to a register 3) 8 packed ADDs/SUBs 4) 3 packed MULs == 18 instructions in total (inner loop - without call/ret) having 11 calculation instructions (that actually do all the work) which should be pretty much minimal considering the original algorithm had 12 operations. While my shufling through tmp[] makes it more optimal with 11 operations - it cannot be implemented in fewer instructions because of tmp[] usage scenario. |
|||
13 Dec 2007, 07:14 |
|
revolution 13 Dec 2007, 08:46
Madis731: You're compiler has done a reasonably good job, but throughput can still be quadrupled (on the right processor). Think about doing more than 4 computations per loop. I would be delighted to see if a compiler can pick up on the technique and manage to fully optimise it.
|
|||
13 Dec 2007, 08:46 |
|
Madis731 13 Dec 2007, 12:12
You mean - give it 16 source tasks and see if it could fit it all in one loop - unroll? Of course it can, but I think that it doesn't gain from the unroll. The reason is that all the add/subs are 1uop and muls are 1-2clocks (3-4uops). It would be the same on average with for loop 4 times or 4 times longer for loop
BUT, I will see... EDIT: OK, checked - the code is same, but there are variations like mov rax, QWORD PTR [rsp+40] is initialized in the prolog (r11 used) and rax is used as counter instead. What IS interesting is the code you see now (when loop numbers get REALLY big) where "call __chkstk" is added: Code: ?CubicInterpolate@@YAMQEAM0000@Z PROC NEAR ; parameter 1(y0): rcx ; parameter 2(y1): rdx ; parameter 3(y2): r8 ; parameter 4(y3): r9 ; parameter 5(mu): 160080 + rsp $B1$1: ; Preds $B1$0 mov eax, 160040 ;7.1 call __chkstk ;7.1 sub rsp, 160040 ;7.1 xor r10d, r10d ; mov r11, QWORD PTR [rsp+160080] ;7.1 xor eax, eax ; ; LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 r15 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15 $B1$2: ; Preds $B1$2 $B1$1 movups xmm0, XMMWORD PTR [rax+rcx] ;12.18 movups xmm4, XMMWORD PTR [rax+rdx] ;12.18 movups xmm2, XMMWORD PTR [rax+r8] ;13.18 movups xmm5, XMMWORD PTR [rax+r9] ;13.18 movups xmm3, XMMWORD PTR [rax+r11] ;16.17 movaps xmm1, xmm4 ;12.18 subps xmm5, xmm2 ;13.18 subps xmm2, xmm0 ;15.18 subps xmm1, xmm0 ;12.18 addps xmm5, xmm1 ;13.26 addps xmm1, xmm5 ;14.19 mulps xmm5, xmm3 ;16.17 add rax, 16 ;10.4 subps xmm5, xmm1 ;16.23 mulps xmm5, xmm3 ;16.31 addps xmm5, xmm2 ;16.37 mulps xmm5, xmm3 ;16.44 addps xmm5, xmm4 ;16.50 movntps XMMWORD PTR [rsp+r10+32], xmm5 ;16.2 add r10, 16 ;10.4 cmp rax, 160000 ;10.4 jl $B1$2 ; Prob 99% ;10.4 ; LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 r15 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15 $B1$3: ; Preds $B1$2 movss xmm0, DWORD PTR [rsp+32] ;19.12 add rsp, 160040 ;19.12 ret ;19.12 ALIGN 2 Btw, does anyone see opportunities to make Intel C++ use F32vec4 (http://download.intel.com/support/performancetools/c/linux/v9/cref_cls.pdf) to make it "understand" what I'm trying to achieve... |
|||
13 Dec 2007, 12:12 |
|
revolution 13 Dec 2007, 13:25
Madis731: It is clear that your compiler did not see the opportunity to optimise this further. It leaves the dependency chain on the result register and wastes the other available registers by simply ignoring them. I think the only way to further optimise this is with assembly. This is where the compilers fall down. The simple reason being that they are not thinking programs.
|
|||
13 Dec 2007, 13:25 |
|
Madis731 13 Dec 2007, 13:48
By that you mean:
1) xmm5 has a 5 (or 6) instruction long dependency just before store 2) other registers like rax rdx rcx rbx rbp rsi rdi r8..r15 xmm6..xmm15 unused 3) maybe even use different registers for a second run and interleave those two sub-functions to lessen the dependency penalty. ? btw, I tested one "force-unroll-by-4" scenario and it gave back ~4x longer inner loop and actually did 16 calculations in it, but I was really surprised when it made rather good interleaves there. Every next stage started loading values just before making the store to our current stage so the throughput should be better than the code you see 2 posts back. The policy was something like stage2_load1;stage1_do_sth;stage2_load2;stage1_do_sth;stage2_load3;stage1_RESULT; and then the stage2 calculations start (73 lines altogether which is 19*4-3 where 3 comes from the loads saved because of only one initial load of [rax+r11]). |
|||
13 Dec 2007, 13:48 |
|
revolution 13 Dec 2007, 14:05
Madis731: Yes to all three points above. On a Penryn processor (not yet in the mainstream) the 4 way execution scheduler would be able to attain 4 instructions per clock cycle with the proper code. This could come at no cost to other lesser capable processors running the same code.
|
|||
13 Dec 2007, 14:05 |
|
revolution 14 Dec 2007, 13:53
BTW: The 4 instructions per clock cycle is for a single core. If you can also multi-thread the code on all 4 cores then the throughput is sixteen times the basic vectorised compiler code. Other issues can come into effect, like memory bottlenecks and thread synchronising but those are very advanced topics whose only purpose is to give programmers headaches.
This would have to be a very special case of the code being run trillions of times to make the multi-threading and hand crafted assembly optimisations worth the effort in a commercial setting. But for a hobbyist this type of effort can be very rewarding for learning purposes regardless of the actual performance increase of the overall program. |
|||
14 Dec 2007, 13:53 |
|
Xorpd! 14 Dec 2007, 21:28
I've been taking a look at this thread and thought I might post a couple of FASM versions of this code. cubic1 is optimized for latency (cycle counts are estimated for Core 2 Duo) while cubic2 is optimized for fewest operations. cubic3 and cubic4 are translations to FASM of code posted earlier in this thread. One thing that is striking about the icl code is that it doesn't seem to reorder operations to minimize latency although intel's optimizer is capable of doing so. Perhaps more experimentation with compiler options is in order. In fact one of the examples above even copied a register in the middle of a long dependency chain (costs 3 cycles on Core 2 Duo) for no reason.
Code: Even Igor seems to have lessened his hero worship of Intel's compiler technology and taken to bitching on the Intel forum about how crappy their compiler is. I haven't done anything with the vectorized versions for the simple reason that I think that if you wanted to do a lot of interpolations you would most likely run a cubic spline through all your data so as to attain continuity through the second derivative and so you wouldn't be recalculating coefficients as this method, which only is continuous through the first derivative, does. |
|||
14 Dec 2007, 21:28 |
|
DJ Mauretto 15 Dec 2007, 12:02
I don't understand why you continue with float or double a and y,
a and y are integer ,only mu ( alias step ) is float Code: y0 DD ? y1 DD ? y2 DD ? y3 DD ? a0 DD ? a1 DD ? a2 DD ? step DD ? mov edx, [y0] ; EDX = y0 mov ecx, [y2] ; ECX = y2 mov esi, [y1] ; ESI = y1 mov edi, [y3] ; EDI = y3 mov eax, esi ; EAX = y1 sub eax, edx ; y1 - y0 sub eax, ecx ; y1 - y0 - y2 add eax, edi ; y1 - y0 - y2 + y3 mov [a0],eax fild [a0] ; a0 mov edi, edx ; EDI = y0 sub edi, esi ; y0 - y1 fmul [step] ; a0 * step sub edi, eax ; y0 - y1 - a0 mov [a1],edi sub ecx, edx ; y2 - y0 fiadd [a1] ; a0 * step + a1 mov [a2],ecx fmul [step] ; a0 * step + a1 * step fiadd [a2] ; a0 * step + a1 * step + a2 fmul [step] ; a0 * step + a1 * step + a2 * step fiadd [y1] ; a0 * step + a1 * step + a2 * step + y1 fistp [Result] ; Have fun with C !! |
|||
15 Dec 2007, 12:02 |
|
revolution 15 Dec 2007, 13:47
DJ Mauretto: How do you come to the conclusion that all the inputs are integer? The OP stated the inputs were double in the original code, and also mentioned that that may change to float.
Also, if, as you state, all inputs are actually integer and the only operations are add, sub and mul then why do you suddenly need float for mu? You code is also much slower by using the FPU, remember the OP wanted to optimise it, not slow it down. You code has many bugs and won't give you the correct answer, try it if you don't believe me. |
|||
15 Dec 2007, 13:47 |
|
Goto page 1, 2 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.