flat assembler
Message board for the users of flat assembler.
Index
> Main > fast cubic interpolation? Goto page Previous 1, 2 |
Author |
|
revolution 15 Dec 2007, 17:20
Your code will produce incorrect results. The OP clearly stated the inputs are targeted as float. The code you posted makes a very bad assumption by assuming facts not in evidence and continues blindly onwards as though the values are integer.
Another problem is that your comments do not match the computation. You would have been better off not commenting it because it only adds to the confusion. |
|||
15 Dec 2007, 17:20 |
|
DJ Mauretto 15 Dec 2007, 17:44
Quote: Your code will produce incorrect results. Hey man,What is your problem? The code is right,the OP has copy and paste from internet this cubic he does not know what it means The comment are right,perhaps missing parentheses, but the calculations are correct,follow the code or take paper and pen. |
|||
15 Dec 2007, 17:44 |
|
Madis731 15 Dec 2007, 23:29
@Xorpd!:
The optimisations were great. I sometimes thought that minimal instruction count was the first thing to look at and it generally worked, but 16 vs 12 instructions (while having 18 vs 27 clocks) is just impressive. I also noticed the whacky use of assignment. Seems that I'm missing some compiler switches :S @DJ Mauretto: First of all - we're not living DOS-times anymore - all videocard shaders are written in float only. That's for the sake of ease and YES you CAN write to an address of 1.5 in any modern videocard because its rounded off by the driver. The names are ATI xxxx (any), nVidia xxxx (any). A Little about SSE - Intel CPUs have it from Pentium III and AMD CPUs have it from Athlon XP, which were both introduced around 1999. There's no need to worry about someone having it. Older than that won't run newer operating systems anyway - neither run graphics normally Actually I followed the code and it seems to yield a correct result (missing only parenthesis), taking into account the integral source and results, but a few questions arise: 1) Why would anyone want to convert to integer when int<->float conversion costs so much and most graphical routines (D3D, OGL) are using floats? 2) Where did you find the integral routine? (I see mostly float ones like this: http://julien.nauroy.net/M3DDoc/class_moteur3_d_1_1_math.html#e4) 3) What if the points you give to the algorithm aren't real screen co-ordinates (like 100 and 314), but imaginary co-ordinates (as is the case with y0 and y3, which they just might be), which don't have to be integers at all? You will have major rounding errors. 4) And why do you expect it to give back integer results? You won't leave any room for example anti-aliasing if you wanna do that later. 5) The same question revolution already asked - why did you put an example of 5 times slower code than the original here? FILDs / FISTPs have 6 clock latency while SSE instructions with xmm,m128 have no latency at all. (Taken from Agner's Core 2 instruction set tables) And finally: 6) What do you mean with THAT sentence: "I told you that the compilers are better than the average assembly programmer" ??? What proves that? Which measurement? There have been no comparisons? How can you assume that? Please don't make something a fact just because you think it is so! |
|||
15 Dec 2007, 23:29 |
|
DJ Mauretto 16 Dec 2007, 10:49
Quote: First of all - we're not living DOS-times anymore - all videocard shaders are written in float only. That's for the sake of ease and YES you CAN write to an address of 1.5 in any modern videocard because its rounded off by the driver. The names are ATI xxxx (any), nVidia xxxx (any). Bad programmer assumption,i don't write code because someone correct my computation. Quote: A Little about SSE - Intel CPUs have it from Pentium III and AMD CPUs have it from Athlon XP, which were both introduced around 1999. There's no need to worry about someone having it. Older than that won't run newer operating systems anyway - neither run graphics normally Older operating systems not used cubic interpolation? Quote: 1) Why would anyone want to convert to integer when int<->float conversion costs so much and most graphical routines (D3D, OGL) are using floats? I do not think in terms of D3D, OGL,i think in terms of right programmer education Quote: 2) Where did you find the integral routine? Quote: 3) What if the points you give to the algorithm aren't real screen co-ordinates (like 100 and 314), but imaginary co-ordinates (as is the case with y0 and y3, which they just might be), which don't have to be integers at all? You will have major rounding errors. Quote: 4) And why do you expect it to give back integer results? You won't leave any room for example anti-aliasing if you wanna do that later. Quote: 5) The same question revolution already asked - why did you put an example of 5 times slower code than the original here? FILDs / FISTPs have 6 clock latency while SSE instructions with xmm,m128 have no latency at all. Quote: 6) What do you mean with THAT sentence: "I told you that the compilers are better than the average assembly programmer" ??? Know that it is a difficult thing to accept for an assembly programmer, but I think so '. 14 cycle instead of 20 for me is just a mental masturbation for assembly programmers, you earn on the one hand and lose from another part, I repeat that the most important thing is the algorithm, then some more cycle is only a paranoia, better drop it as soon as possible otherwise you may find losing time unnecessarily |
|||
16 Dec 2007, 10:49 |
|
Madis731 16 Dec 2007, 11:00
DJ Mauretto wrote: 14 cycle instead of 20 for me is just a mental masturbation for assembly programmers That is actually SO TRUE, but I just love to m* on tiny clock cycle counts. Actually I get high on them. But that is what I am. So, I understand you better now - sorry it took so long |
|||
16 Dec 2007, 11:00 |
|
revolution 16 Dec 2007, 11:40
Madis731: You and me also, I spend all day wondering how to save one clock cycle in my GUI code so that when the user clicks it can respond 0.3ns faster. Forget about the fact that it is only clicked once per day, that is not important.
|
|||
16 Dec 2007, 11:40 |
|
Borsuc 17 Dec 2007, 14:14
Madis731 wrote: 3) What if the points you give to the algorithm aren't real screen co-ordinates (like 100 and 314), but imaginary co-ordinates (as is the case with y0 and y3, which they just might be), which don't have to be integers at all? You will have major rounding errors. I know DJ Mauretto was not using fixed point, but 'integers' can be used for anti-aliasing, 'fractional' points, etc.. just as easily as floats, especially if we are using a limited dynamic range, like screen coords, whatever their range being (I suspect 0...1 is the best since it uses ALL the bits in a fixed point implementation). Float is not "magically" faster than int. Whatever it does comes at a price, either in latency (which I doubt) or efficency (performance per something (like e.g energy, but not limited to)). If this is not so, then it means Intel or AMD made a lousy job at optimizing integer operations (I fear for that) and put them at an unfair disadvantage. Floats are VERY useful for high-dynamic ranges, but for screen coordinates which range from a limited domain (like 0..1) I think floats are not appropiate there, and one should use fixed point integers instead (since the computations won't vary much in the exponents). BTW: who said the op wanted this for graphics? As an algorithmic overview: Why did you not try to factor the polynomial? e.g: Code: a*t^3 + b*t^2 + c*t + d Code: t*(t*(t*a+b)+c)+d Note also that if you know the roots of the polynomial (i.e when the polynomial equals 0), you can factor it more efficiently like this: Code: (t-r1)*(t-r2)*(t-r3)*a Madis731 wrote: 5) The same question revolution already asked - why did you put an example of 5 times slower code than the original here? FILDs / FISTPs have 6 clock latency while SSE instructions with xmm,m128 have no latency at all. revolution wrote: Madis731: You and me also, I spend all day wondering how to save one clock cycle in my GUI code so that when the user clicks it can respond 0.3ns faster. Forget about the fact that it is only clicked once per day, that is not important. it's not only '0.3ns', but think of it like this: if you didn't do it, no one would've (at least if they used the same mentality as M$), and then the code wouldn't improve in this aspect -- so it was a GOOD thing, nonetheless, getting closer to the limit (when n->infinity) to perfection.. albeit we're not at n=infinity yet |
|||
17 Dec 2007, 14:14 |
|
revolution 17 Dec 2007, 14:26
The_Grey_Beast wrote: it's not only '0.3ns', but think of it like this: if you didn't do it, no one would've (at least if they used the same mentality as M$), and then the code wouldn't improve in this aspect -- so it was a GOOD thing, nonetheless, getting closer to the limit (when n->infinity) to perfection.. albeit we're not at n=infinity yet |
|||
17 Dec 2007, 14:26 |
|
Borsuc 17 Dec 2007, 14:36
well I know you meant nano-seconds, but I tried to joke a bit with calculus stuff about it. i hope you got the idea
|
|||
17 Dec 2007, 14:36 |
|
Madis731 17 Dec 2007, 23:43
@The_Gray_Beast:
What he should've used is fixed point, not the floating point and that was the *real* (compiler's) mistake...I totally agree Actually we did this t*(t*(t*a+b)+c)+d manually and also Intel's compiler figured it out so I don't understand where are you getting at!? Look at a few posts back, both C and ASM are given. Memory moves (especially aligned ones) have little to none latency compared to reg,reg moves because of architectural design (ARM has high latency when doing memory operations). For example movaps xmm0,[esp] doesn't need register rename after a 'suspicious' operation on an xmm0, whereas movaps xmm0,xmm1 does. And by latency I mean "additional latency" - not the actual micro-ops taken. They still do take time if that's what you're surprised at. The speed of different ALU units are again architecture specific and they are related in this way: SSE_INT >= SSE_FLOAT >= INT > FLOAT, the '>' sign meaning left is faster than right. And IIRC Intel's FPUs are known to be slower than i.e. AMDs. |
|||
17 Dec 2007, 23:43 |
|
levicki 22 Dec 2007, 14:39
Xorpd! wrote: Even Igor seems to have lessened his hero worship of Intel's compiler technology and taken to bitching on the Intel forum about how crappy their compiler is. Yes, but I didn't stop at that, I submitted some issues with their support and some of them are already fixed or at least targeted to be fixed. Without feedback they can't improve the compiler, right? I would still take Intel Compiler over assembler any day. Besides, you said it yourself, one could use numeric optimizations to improve this on a much larger scale. Also, memory access could be optimized if we could see the "bigger picture" (i.e. the whole algorithm/program). Focusing on this function is kind of pointless. It is like trying to optimize your chainsaw engine so you can cut down the trees faster with it when you have bulldozer or dynamite. |
|||
22 Dec 2007, 14:39 |
|
Borsuc 22 Dec 2007, 17:32
Madis731 wrote:
Madis731 wrote: Memory moves (especially aligned ones) have little to none latency compared to reg,reg moves because of architectural design (ARM has high latency when doing memory operations). For example movaps xmm0,[esp] doesn't need register rename after a 'suspicious' operation on an xmm0, whereas movaps xmm0,xmm1 does. Madis731 wrote: The speed of different ALU units are again architecture specific and they are related in this way: SSE_INT >= SSE_FLOAT >= INT > FLOAT, the '>' sign meaning left is faster than right. And IIRC Intel's FPUs are known to be slower than i.e. AMDs. |
|||
22 Dec 2007, 17:32 |
|
SomeoneNew 28 Dec 2007, 19:49
Thanks everyone
Yes, all of my input data is float, because its meant to interpolate object positions from snapshot to snapshot (network) given the time step (mu, from 0.0 to 1.0) one can estimate something that otherwise wouldn't be there. The thing is, I'm also using this to draw curved lines (instead of bezier splines, which I found too complicated for my needs) So the usage of this routine goes up by quite a lot, imagine... I'm calling this from a 0.0 to 1.0 on X stepping per data set. I understand that cosine interpolation is faster, but it's nowhere near close to the results of cubic interpolation. Hence my need to speed this up, even if it's just a little bit. PS: DJ Mauretto, chill out... I believe I understand more math than you ever will, but I'm here for ASM help and not for pointless fight. if I pasted a code from a paper (WHICH I STATED) was because my code is OOP and differs from an abstract implementation, period. Now you see why people won't share their real code?, because some monkeys are out there crawling with anger. ie you, no offense though. But this is no popularity competition. At the end, if you don't provide a solution you are indeed part of the problem (and no one is forcing you to do so; since it's your call, take it like a man now). |
|||
28 Dec 2007, 19:49 |
|
DJ Mauretto 28 Dec 2007, 20:40
Ok admit to have exaggerated
|
|||
28 Dec 2007, 20:40 |
|
Borsuc 02 Jan 2008, 16:07
SomeoneNew wrote: The thing is, I'm also using this to draw curved lines (instead of bezier splines, which I found too complicated for my needs) So the usage of this routine goes up by quite a lot, imagine... I'm calling this from a 0.0 to 1.0 on X stepping per data set. You can take the 3 derivatives of this polynomial, and use those to compute "how much to add each step" to each of the derivatives. Something like this (pseudo-code): Code: data = Compute cubic (normal way) at 0.0 d1 = Compute 1st derivative d2 = Compute 2nd derivative d3 = Compute 3rd derivative ; multiply derivatives by step size d1 *= step d2 *= step d3 *= step loop from 0.0 to 1.0 data += d1 d1 += d2 d2 += d3 ; do something with 'data' end loop |
|||
02 Jan 2008, 16:07 |
|
Goto page Previous 1, 2 < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.