flat assembler
Message board for the users of flat assembler.
Index
> Main > sse, making use of all registers when possible. |
Author |
|
revolution 30 Nov 2008, 01:37
Which code are you referring to?
If you want to know whether speed enhancements are "worth the hassle" then you need a utility function to determine the benefit/cost ratio. If your code is intended to run only a few times on one computer then usually coding for speed is a wasted effort unless your utility function includes such intangibles as personal learning or personal enjoyment/satisfaction. If your code is intended to run continuously on many computers and throughput is important (i.e. something that generates results at a particular rate based upon program performance) then coding for speed can be of great use. But you must be careful to ensure that effort is put into the proper places of the code. Optimising one time initialisation code is generally futile, just do the inner loops of the main processing functions. |
|||
30 Nov 2008, 01:37 |
|
adnimo 30 Nov 2008, 02:32
actually I may have to give an example, perhaps I wasn't clear
original code: Code: movups xmm0, dqword[v_left] movups xmm1, dqword[v_right] mulps xmm0, xmm1 movups dqword[v_result], xmm0 movups xmm0, dqword[v_left+16] movups xmm1, dqword[v_right+16] mulps xmm0, xmm1 movups dqword[v_result+16], xmm0 movups xmm0, dqword[v_left+32] movups xmm1, dqword[v_right+32] mulps xmm0, xmm1 movups dqword[v_result+32], xmm0 movups xmm0, dqword[v_left+48] movups xmm1, dqword[v_right+48] mulps xmm0, xmm1 movups dqword[v_result+48], xmm0 and the refactored code: Code: movups xmm0, dqword[v_left] movups xmm1, dqword[v_right] movups xmm2, dqword[v_left+16] movups xmm3, dqword[v_right+16] movups xmm4, dqword[v_left+32] movups xmm5, dqword[v_right+32] movups xmm6, dqword[v_left+48] movups xmm7, dqword[v_right+48] mulps xmm0, xmm1 mulps xmm2, xmm3 mulps xmm4, xmm5 mulps xmm6, xmm7 movups dqword[v_result], xmm0 movups dqword[v_result+16], xmm2 movups dqword[v_result+32], xmm4 movups dqword[v_result+48], xmm6 nevermind the unaligned calls and whatnot, It's just to illustrate what I've been asking about. would the second piece of code perform better on today cpus? |
|||
30 Nov 2008, 02:32 |
|
revolution 30 Nov 2008, 02:47
adnimo wrote: would the second piece of code perform better on today cpus? If while testing you cannot see any difference in speed then you know that efforts to get it faster may be pointless and that is where your utility function comes in. Weigh up the benefits against the costs to see if you should continue with it. |
|||
30 Nov 2008, 02:47 |
|
adnimo 30 Nov 2008, 03:00
Well, I think those are just empty words... you should know, as you mention, not all CPUs are the same... so, even if I try and I get no speed increase at all, taking a decision from that point would not only be a biased move but also stupid to say the least.
Since I haven't tested on new CPUs, I don't know whether it's faster or not hence I ask, because some other people could of tried this already... |
|||
30 Nov 2008, 03:00 |
|
bitRAKE 30 Nov 2008, 03:25
In general, no - it would not be faster - that code is memory bound in a major way.
|
|||
30 Nov 2008, 03:25 |
|
revolution 30 Nov 2008, 03:54
adnimo wrote: Well, I think those are just empty words... you should know, as you mention, not all CPUs are the same... so, even if I try and I get no speed increase at all, taking a decision from that point would not only be a biased move but also stupid to say the least. BitRAKE: What you say my not be the case. If the data is in the cache then it can run at full CPU speed. There are many factors to consider when trying to get the speed of code. |
|||
30 Nov 2008, 03:54 |
|
bitRAKE 30 Nov 2008, 04:12
revolution wrote: BitRAKE: What you say my not be the case. If the data is in the cache then it can run at full CPU speed. There are many factors to consider when trying to get the speed of code. _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
30 Nov 2008, 04:12 |
|
revolution 30 Nov 2008, 04:32
bitRAKE wrote: ... "in general". |
|||
30 Nov 2008, 04:32 |
|
bitRAKE 30 Nov 2008, 06:00
...and here in this thread we can only work with what he has given. Which is nothing specific enough to make any other comments than the advice you have tried to give. Yet, he wants a general answer.
|
|||
30 Nov 2008, 06:00 |
|
asmfan 30 Nov 2008, 13:30
It will be faster. burst reads/writes. maximum parallelism on registers.
also - no need to load memory to register: mulps xmm0, dqword[v_right] can work too. |
|||
30 Nov 2008, 13:30 |
|
Kuemmel 30 Nov 2008, 16:31
Hi Adnimo,
in the end you really got to make a small benchmark to measure if this or that is slower. Breaking dependancy chains by reordering code and of course using more registers should benefit normally, at least it shouldn't slow it down. Just some other recommendations. As most modern cores have different more or less specialized parallel instruction units, where one may be can do a MUL but the other not, sometimes it can also make sense to do it more or less in a quite not good readable way like that (it's often really a matter of try and error and AMD's show different behaviour than Intels): Code: movups xmm0, dqword[v_left] movups xmm1, dqword[v_right] movups xmm2, dqword[v_left+16] mulps xmm0, xmm1 movups xmm3, dqword[v_right+16] movups dqword[v_result], xmm0 movups xmm4, dqword[v_left+32] mulps xmm2, xmm3 movups xmm5, dqword[v_right+32] movups xmm6, dqword[v_left+48] mulps xmm4, xmm5 movups dqword[v_result+16], xmm2 movups xmm7, dqword[v_right+48] mulps xmm6, xmm7 movups dqword[v_result+32], xmm4 movups dqword[v_result+48], xmm6 |
|||
30 Nov 2008, 16:31 |
|
adnimo 30 Nov 2008, 17:26
I don't give details because I'm asking "in general", should aligning the calls like that be a good idea or not, that's all I ask But if you insist, I won't be processing a lot of data on each call, just a bunch of matrices.
Regarding the examples: I can't find a difference in speed with my benchmark because my CPU is quite old... both examples ran at almost exactly the same time and after averaging all the results I found out there was no difference in speed, but that's just in this old, single core CPU... I know about loading memory directly to mulps but on my testbed this causes a MAV, I don't know if it's due to the fact that this CPU has an early SSE implementation or perhaps the HLL on top is messing things up. I was wondering however, if placing calls like that would generally benefit from speed. It's just one of those things I wonder about when I'm learning new instructions, etc. The "what if..." factor. Also since there are 8 available registers in 32bit mode it made perfectly good sense to use them all whenever possible. I know about the "premature optimization" deal and I've been reading the intel optimization manual (slowly but surely) yet, this is me learning on my spare time so excuse any major stupidity from my side. I just find it fascinating, the whole execution speed deal. But don't worry, I always code functionality before I optimize anything |
|||
30 Nov 2008, 17:26 |
|
revolution 30 Nov 2008, 18:00
adnimo: I suggest to you one small thing to try. Deliberately put in 10 or 20 NOPs and see if there is any difference in speed. If you cannot see any significant difference then I suggest that optimising will not benefit you in any practical way.
|
|||
30 Nov 2008, 18:00 |
|
Kuemmel 30 Nov 2008, 18:56
adnimo wrote: Regarding the examples: I can't find a difference in speed with my benchmark because my CPU is quite old... both examples ran at almost exactly the same time and after averaging all the results I found out there was no difference in speed, but that's just in this old, single core CPU... ...actually some of the optimization that I did for my Mandelbrot bench didn't do much on old CPU's, like AMD K7, Pentium III/IV, but a lot for C2D, so it might be worth running some test-variations of your code on a C2D and look if there are differences...and despite the Intel manuals, make sure to read Agner Fog's manuals: http://www.agner.org/optimize/ |
|||
30 Nov 2008, 18:56 |
|
asmfan 30 Nov 2008, 19:44
you're right about access violation. if src in mulp* isn't aligned on 16 bytes according to manuals.
|
|||
30 Nov 2008, 19:44 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.