flat assembler
Message board for the users of flat assembler.

Index > Main > fast cubic interpolation?

Goto page Previous  1, 2
Author
Thread Post new topic Reply to topic
DJ Mauretto



Joined: 14 Mar 2007
Posts: 464
Location: Rome,Italy
DJ Mauretto
Quote:

How do you come to the conclusion that all the inputs are integer?


Did you ever tried to write video memory at offset 1.5?
If you can give me the name of your video card Wink
Quote:

Also, if, as you state, all inputs are actually integer and the only operations are add, sub and mul then why do you suddenly need float for mu?

beacause is the step of interpolation, it's a float number.
Quote:
u code is also much slower by using the FPU, remember the OP wanted to optimise it, not slow it down. You code has many bugs and won't give you the correct answer, try it if you don't believe me.

This is not my code,i found a old C compiler ,and this is the optimise output,
i don't have used SSE switch for compatibility reason,
it's not bug ,look at this better,seems to me very good and
don't forget to learn math more Wink
I told you that the compilers are better than the average assembly programmer Razz
Post 15 Dec 2007, 15:55
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17476
Location: In your JS exploiting you and your system
revolution
Your code will produce incorrect results. The OP clearly stated the inputs are targeted as float. The code you posted makes a very bad assumption by assuming facts not in evidence and continues blindly onwards as though the values are integer.

Another problem is that your comments do not match the computation. You would have been better off not commenting it because it only adds to the confusion.
Post 15 Dec 2007, 17:20
View user's profile Send private message Visit poster's website Reply with quote
DJ Mauretto



Joined: 14 Mar 2007
Posts: 464
Location: Rome,Italy
DJ Mauretto
Quote:
Your code will produce incorrect results.

Hey man,What is your problem?
The code is right,the OP has copy and paste from internet this cubic
he does not know what it means The comment are right,perhaps missing parentheses, but the calculations are correct,follow the code or take paper and pen.
Post 15 Dec 2007, 17:44
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2140
Location: Estonia
Madis731
@Xorpd!:
The optimisations were great. I sometimes thought that minimal instruction count was the first thing to look at and it generally worked, but 16 vs 12 instructions (while having 18 vs 27 clocks) is just impressive.
I also noticed the whacky use of assignment. Seems that I'm missing some compiler switches :S

@DJ Mauretto:
First of all - we're not living DOS-times anymore - all videocard shaders are written in float only. That's for the sake of ease and YES you CAN write to an address of 1.5 in any modern videocard because its rounded off by the driver. The names are ATI xxxx (any), nVidia xxxx (any).

A Little about SSE - Intel CPUs have it from Pentium III and AMD CPUs have it from Athlon XP, which were both introduced around 1999. There's no need to worry about someone having it. Older than that won't run newer operating systems anyway - neither run graphics normally Razz

Actually I followed the code and it seems to yield a correct result (missing only parenthesis), taking into account the integral source and results, but a few questions arise:

1) Why would anyone want to convert to integer when int<->float conversion costs so much and most graphical routines (D3D, OGL) are using floats?
2) Where did you find the integral routine? (I see mostly float ones like this: http://julien.nauroy.net/M3DDoc/class_moteur3_d_1_1_math.html#e4)
3) What if the points you give to the algorithm aren't real screen co-ordinates (like 100 and 314), but imaginary co-ordinates (as is the case with y0 and y3, which they just might be), which don't have to be integers at all? You will have major rounding errors.
4) And why do you expect it to give back integer results? You won't leave any room for example anti-aliasing if you wanna do that later.
5) The same question revolution already asked - why did you put an example of 5 times slower code than the original here? FILDs / FISTPs have 6 clock latency while SSE instructions with xmm,m128 have no latency at all.
(Taken from Agner's Core 2 instruction set tables)

And finally:
6) What do you mean with THAT sentence: "I told you that the compilers are better than the average assembly programmer" ???

What proves that? Which measurement? There have been no comparisons? How can you assume that?

Please don't make something a fact just because you think it is so!

_________________
My updated idol Very Happy http://www.agner.org/optimize/
Post 15 Dec 2007, 23:29
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
DJ Mauretto



Joined: 14 Mar 2007
Posts: 464
Location: Rome,Italy
DJ Mauretto
Quote:
First of all - we're not living DOS-times anymore - all videocard shaders are written in float only. That's for the sake of ease and YES you CAN write to an address of 1.5 in any modern videocard because its rounded off by the driver. The names are ATI xxxx (any), nVidia xxxx (any).

Bad programmer assumption,i don't write code because someone correct my
computation.
Quote:
A Little about SSE - Intel CPUs have it from Pentium III and AMD CPUs have it from Athlon XP, which were both introduced around 1999. There's no need to worry about someone having it. Older than that won't run newer operating systems anyway - neither run graphics normally
I don't agree,there are people that hold pentium even 486 cpu,
Older operating systems not used cubic interpolation?
Quote:
1) Why would anyone want to convert to integer when int<->float conversion costs so much and most graphical routines (D3D, OGL) are using floats?

I do not think in terms of D3D, OGL,i think in terms of right programmer education
Quote:
2) Where did you find the integral routine?
I just sensed what he wanted to do,perhaps he could tell us better than what he wants to do
Quote:
3) What if the points you give to the algorithm aren't real screen co-ordinates (like 100 and 314), but imaginary co-ordinates (as is the case with y0 and y3, which they just might be), which don't have to be integers at all? You will have major rounding errors.
Agree, again i just sensed what he wanted to do
Quote:
4) And why do you expect it to give back integer results? You won't leave any room for example anti-aliasing if you wanna do that later.
see above
Quote:
5) The same question revolution already asked - why did you put an example of 5 times slower code than the original here? FILDs / FISTPs have 6 clock latency while SSE instructions with xmm,m128 have no latency at all.
First of all this is only a example of compiler output and for me it's good,again i have choose FPU for compatibility reason,he can choose SSE switch on his compiler
Quote:
6) What do you mean with THAT sentence: "I told you that the compilers are better than the average assembly programmer" ???

What proves that? Which measurement? There have been no comparisons? How can you assume that?

Know that it is a difficult thing to accept for an assembly programmer, but I think so '.
14 cycle instead of 20 for me is just a mental masturbation for assembly programmers, you earn on the one hand and lose from another part, I repeat that the most important thing is the algorithm, then some more cycle is only a paranoia, better drop it as soon as possible otherwise you may find losing time unnecessarily
Post 16 Dec 2007, 10:49
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2140
Location: Estonia
Madis731
DJ Mauretto wrote:
14 cycle instead of 20 for me is just a mental masturbation for assembly programmers


Very Happy That is actually SO TRUE, but I just love to m* on tiny clock cycle counts. Actually I get high on them. But that is what I am.

So, I understand you better now - sorry it took so long Wink

_________________
My updated idol Very Happy http://www.agner.org/optimize/
Post 16 Dec 2007, 11:00
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17476
Location: In your JS exploiting you and your system
revolution
Madis731: You and me also, I spend all day wondering how to save one clock cycle in my GUI code so that when the user clicks it can respond 0.3ns faster. Forget about the fact that it is only clicked once per day, that is not important.
Post 16 Dec 2007, 11:40
View user's profile Send private message Visit poster's website Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
Madis731 wrote:
3) What if the points you give to the algorithm aren't real screen co-ordinates (like 100 and 314), but imaginary co-ordinates (as is the case with y0 and y3, which they just might be), which don't have to be integers at all? You will have major rounding errors.
4) And why do you expect it to give back integer results? You won't leave any room for example anti-aliasing if you wanna do that later.
Well yeah he made a terrible mistake by converting int to float and vice-versa in the middle of an inner algorithm process.

I know DJ Mauretto was not using fixed point, but 'integers' can be used for anti-aliasing, 'fractional' points, etc.. just as easily as floats, especially if we are using a limited dynamic range, like screen coords, whatever their range being (I suspect 0...1 is the best since it uses ALL the bits in a fixed point implementation). Float is not "magically" faster than int. Whatever it does comes at a price, either in latency (which I doubt) or efficency (performance per something (like e.g energy, but not limited to)). If this is not so, then it means Intel or AMD made a lousy job at optimizing integer operations (I fear for that) and put them at an unfair disadvantage. Smile

Floats are VERY useful for high-dynamic ranges, but for screen coordinates which range from a limited domain (like 0..1) I think floats are not appropiate there, and one should use fixed point integers instead (since the computations won't vary much in the exponents).

BTW: who said the op wanted this for graphics?


As an algorithmic overview: Why did you not try to factor the polynomial? e.g:
Code:
a*t^3 + b*t^2 + c*t + d    
where 't' is 'mu', 'a', 'b', 'c' and 'd' are the coefficients, and the '^' means raise to power. You can factor this simply as:
Code:
t*(t*(t*a+b)+c)+d    
obviously this can't be done easily in parallel, but hey, you need the same number of operations without parallelism, which is a gain! (a gain because doing a thing as fast as doing it in parallel means it's an improvement).

Note also that if you know the roots of the polynomial (i.e when the polynomial equals 0), you can factor it more efficiently like this:
Code:
(t-r1)*(t-r2)*(t-r3)*a    
where 'r1', 'r2' and 'r3' are the roots. This is better since you may be able to do the three multiplications in parallel, resulting only two (albeit, only if you want so). Problem is, if you don't have the roots, calculating them is so slow you should stop thinking about it (complex roots are easily factored because they appear in pairs with their conjugates).

Madis731 wrote:
5) The same question revolution already asked - why did you put an example of 5 times slower code than the original here? FILDs / FISTPs have 6 clock latency while SSE instructions with xmm,m128 have no latency at all.
No latency at all? I'm surprised Cool

revolution wrote:
Madis731: You and me also, I spend all day wondering how to save one clock cycle in my GUI code so that when the user clicks it can respond 0.3ns faster. Forget about the fact that it is only clicked once per day, that is not important.
Well then we all have interesting stories to tell Smile

it's not only '0.3ns', but think of it like this: if you didn't do it, no one would've (at least if they used the same mentality as M$), and then the code wouldn't improve in this aspect -- so it was a GOOD thing, nonetheless, getting closer to the limit (when n->infinity) to perfection.. albeit we're not at n=infinity yet Smile
Post 17 Dec 2007, 14:14
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17476
Location: In your JS exploiting you and your system
revolution
The_Grey_Beast wrote:
it's not only '0.3ns', but think of it like this: if you didn't do it, no one would've (at least if they used the same mentality as M$), and then the code wouldn't improve in this aspect -- so it was a GOOD thing, nonetheless, getting closer to the limit (when n->infinity) to perfection.. albeit we're not at n=infinity yet Smile
I always viewed ns meaning nano-seconds. I never thought about n being a variable, Interesting twist!
Post 17 Dec 2007, 14:26
View user's profile Send private message Visit poster's website Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
well I know you meant nano-seconds, but I tried to joke a bit with calculus stuff about it. i hope you got the idea Wink
Post 17 Dec 2007, 14:36
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2140
Location: Estonia
Madis731
@The_Gray_Beast:
What he should've used is fixed point, not the floating point and that was the *real* (compiler's) mistake...I totally agree

Actually we did this t*(t*(t*a+b)+c)+d manually and also Intel's compiler figured it out so I don't understand where are you getting at!? Look at a few posts back, both C and ASM are given.

Memory moves (especially aligned ones) have little to none latency compared to reg,reg moves because of architectural design (ARM has high latency when doing memory operations). For example movaps xmm0,[esp] doesn't need register rename after a 'suspicious' operation on an xmm0, whereas movaps xmm0,xmm1 does.
And by latency I mean "additional latency" - not the actual micro-ops taken. They still do take time if that's what you're surprised at.

The speed of different ALU units are again architecture specific and they are related in this way: SSE_INT >= SSE_FLOAT >= INT > FLOAT, the '>' sign meaning left is faster than right. And IIRC Intel's FPUs are known to be slower than i.e. AMDs.
Post 17 Dec 2007, 23:43
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
levicki



Joined: 29 Jul 2007
Posts: 26
Location: Belgrade, Serbia
levicki
Xorpd! wrote:
Even Igor seems to have lessened his hero worship of Intel's compiler technology and taken to bitching on the Intel forum about how crappy their compiler is.


Yes, but I didn't stop at that, I submitted some issues with their support and some of them are already fixed or at least targeted to be fixed. Without feedback they can't improve the compiler, right?

I would still take Intel Compiler over assembler any day. Besides, you said it yourself, one could use numeric optimizations to improve this on a much larger scale. Also, memory access could be optimized if we could see the "bigger picture" (i.e. the whole algorithm/program). Focusing on this function is kind of pointless. It is like trying to optimize your chainsaw engine so you can cut down the trees faster with it when you have bulldozer or dynamite.
Post 22 Dec 2007, 14:39
View user's profile Send private message Visit poster's website MSN Messenger Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
Madis731 wrote:

Actually we did this t*(t*(t*a+b)+c)+d manually and also Intel's compiler figured it out so I don't understand where are you getting at!? Look at a few posts back, both C and ASM are given.
Because most code back then was parallel, and this can't be done in parallel, sorry I'll look again.

Madis731 wrote:
Memory moves (especially aligned ones) have little to none latency compared to reg,reg moves because of architectural design (ARM has high latency when doing memory operations). For example movaps xmm0,[esp] doesn't need register rename after a 'suspicious' operation on an xmm0, whereas movaps xmm0,xmm1 does.
And by latency I mean "additional latency" - not the actual micro-ops taken. They still do take time if that's what you're surprised at.
aha, now that makes sense Smile

Madis731 wrote:
The speed of different ALU units are again architecture specific and they are related in this way: SSE_INT >= SSE_FLOAT >= INT > FLOAT, the '>' sign meaning left is faster than right. And IIRC Intel's FPUs are known to be slower than i.e. AMDs.
I actually heard the opposite, that AMDs have very fast integer multiplys, but I'll consult Agner's manual just to be sure Wink
Post 22 Dec 2007, 17:32
View user's profile Send private message Reply with quote
SomeoneNew



Joined: 12 Aug 2006
Posts: 54
SomeoneNew
Thanks everyone Smile

Yes, all of my input data is float, because its meant to interpolate object positions from snapshot to snapshot (network) given the time step (mu, from 0.0 to 1.0) one can estimate something that otherwise wouldn't be there.

The thing is, I'm also using this to draw curved lines (instead of bezier splines, which I found too complicated for my needs) So the usage of this routine goes up by quite a lot, imagine... I'm calling this from a 0.0 to 1.0 on X stepping per data set.

I understand that cosine interpolation is faster, but it's nowhere near close to the results of cubic interpolation. Hence my need to speed this up, even if it's just a little bit.

Smile

PS: DJ Mauretto, chill out... I believe I understand more math than you ever will, but I'm here for ASM help and not for pointless fight. if I pasted a code from a paper (WHICH I STATED) was because my code is OOP and differs from an abstract implementation, period. Now you see why people won't share their real code?, because some monkeys are out there crawling with anger. ie you, no offense though. But this is no popularity competition. At the end, if you don't provide a solution you are indeed part of the problem (and no one is forcing you to do so; since it's your call, take it like a man now).



Post 28 Dec 2007, 19:49
View user's profile Send private message Reply with quote
DJ Mauretto



Joined: 14 Mar 2007
Posts: 464
Location: Rome,Italy
DJ Mauretto
Ok admit to have exaggerated Smile
Post 28 Dec 2007, 20:40
View user's profile Send private message Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
SomeoneNew wrote:
The thing is, I'm also using this to draw curved lines (instead of bezier splines, which I found too complicated for my needs) So the usage of this routine goes up by quite a lot, imagine... I'm calling this from a 0.0 to 1.0 on X stepping per data set.
If you're going to use it like this, then you can VASTLY speed it up, by evaluating it incrementally.

You can take the 3 derivatives of this polynomial, and use those to compute "how much to add each step" to each of the derivatives.

Something like this (pseudo-code):

Code:
data = Compute cubic (normal way) at 0.0
d1 = Compute 1st derivative
d2 = Compute 2nd derivative
d3 = Compute 3rd derivative

; multiply derivatives by step size
d1 *= step
d2 *= step
d3 *= step

loop from 0.0 to 1.0
  data += d1
  d1 += d2
  d2 += d3

  ; do something with 'data'
end loop    
So you only have 3 additions per loop Smile
Post 02 Jan 2008, 16:07
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on YouTube, Twitter.

Website powered by rwasa.