flat assembler
Message board for the users of flat assembler.

Index > Main > sse, making use of all registers when possible.

Author
Thread Post new topic Reply to topic
adnimo



Joined: 18 Jul 2008
Posts: 49
adnimo
I've been thinking about refactoring my sse code to make use of as many registers as possible, so I can keep the same calls in groups. Would this be a good idea specially for the newest CPUs? (I didn't find any speed changes on single core CPUs).

If so, do you reckon the speed benefit could be worth the hassle?
Post 29 Nov 2008, 18:57
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17669
Location: In your JS exploiting you and your system
revolution
Which code are you referring to?

If you want to know whether speed enhancements are "worth the hassle" then you need a utility function to determine the benefit/cost ratio.

If your code is intended to run only a few times on one computer then usually coding for speed is a wasted effort unless your utility function includes such intangibles as personal learning or personal enjoyment/satisfaction.

If your code is intended to run continuously on many computers and throughput is important (i.e. something that generates results at a particular rate based upon program performance) then coding for speed can be of great use. But you must be careful to ensure that effort is put into the proper places of the code. Optimising one time initialisation code is generally futile, just do the inner loops of the main processing functions.
Post 30 Nov 2008, 01:37
View user's profile Send private message Visit poster's website Reply with quote
adnimo



Joined: 18 Jul 2008
Posts: 49
adnimo
actually I may have to give an example, perhaps I wasn't clear Sad

original code:


Code:
        movups xmm0, dqword[v_left]
        movups xmm1, dqword[v_right]
        mulps xmm0, xmm1
        movups dqword[v_result], xmm0
        
        movups xmm0, dqword[v_left+16]
        movups xmm1, dqword[v_right+16]
        mulps xmm0, xmm1
        movups dqword[v_result+16], xmm0
        
        movups xmm0, dqword[v_left+32]
        movups xmm1, dqword[v_right+32]
        mulps xmm0, xmm1
        movups dqword[v_result+32], xmm0
        
        movups xmm0, dqword[v_left+48]
        movups xmm1, dqword[v_right+48]
        mulps xmm0, xmm1
        movups dqword[v_result+48], xmm0    



and the refactored code:


Code:
        movups xmm0, dqword[v_left]
        movups xmm1, dqword[v_right]
        
        movups xmm2, dqword[v_left+16]
        movups xmm3, dqword[v_right+16]
        
        movups xmm4, dqword[v_left+32]
        movups xmm5, dqword[v_right+32]
        
        movups xmm6, dqword[v_left+48]
        movups xmm7, dqword[v_right+48]
        
        mulps xmm0, xmm1
        mulps xmm2, xmm3
        mulps xmm4, xmm5
        mulps xmm6, xmm7
        
        movups dqword[v_result], xmm0
        movups dqword[v_result+16], xmm2
        movups dqword[v_result+32], xmm4
        movups dqword[v_result+48], xmm6    


nevermind the unaligned calls and whatnot, It's just to illustrate what I've been asking about.

would the second piece of code perform better on today cpus?
Post 30 Nov 2008, 02:32
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17669
Location: In your JS exploiting you and your system
revolution
adnimo wrote:
would the second piece of code perform better on today cpus?
The only way to know is to to run and see if there is a difference for you. It is not possible to simply "look" at code and say it is faster/slower. Modern CPUs are extremely complex and have many features that can hide many potential delays with registers reuse (one such feature being register renaming). So the register renaming feature of a modern CPU might completely eliminate any benefit you could obtain by manually using different registers, or perhaps not, it all depends upon a host of different factors. Even things like memory bus usage from previous code from the same CPU and also memory bus traffic from other CPUs, or even more simply, code alignment and how each instruction fits into the 16-byte decoder window etc. can all have a relatively large influence upon code execution speed. That is why you must ALWAYS do performance tests on live running code while trying different arrangements of the same code, there is no substitute for it, you just have to test it.

If while testing you cannot see any difference in speed then you know that efforts to get it faster may be pointless and that is where your utility function comes in. Weigh up the benefits against the costs to see if you should continue with it.
Post 30 Nov 2008, 02:47
View user's profile Send private message Visit poster's website Reply with quote
adnimo



Joined: 18 Jul 2008
Posts: 49
adnimo
Well, I think those are just empty words... you should know, as you mention, not all CPUs are the same... so, even if I try and I get no speed increase at all, taking a decision from that point would not only be a biased move but also stupid to say the least.

Since I haven't tested on new CPUs, I don't know whether it's faster or not hence I ask, because some other people could of tried this already...
Post 30 Nov 2008, 03:00
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3045
Location: vpcmipstrm
bitRAKE
In general, no - it would not be faster - that code is memory bound in a major way.
Post 30 Nov 2008, 03:25
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17669
Location: In your JS exploiting you and your system
revolution
adnimo wrote:
Well, I think those are just empty words... you should know, as you mention, not all CPUs are the same... so, even if I try and I get no speed increase at all, taking a decision from that point would not only be a biased move but also stupid to say the least.

Since I haven't tested on new CPUs, I don't know whether it's faster or not hence I ask, because some other people could of tried this already...
I think you missed my point. You can't just post a snippet of code and ask others to run it for you because it is taken out of context. If I run it here and have all the test data in cache and find a good speed-up then what? It won't help you in a real system if the your data comes from the HDD.

BitRAKE: What you say my not be the case. If the data is in the cache then it can run at full CPU speed. There are many factors to consider when trying to get the speed of code.
Post 30 Nov 2008, 03:54
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3045
Location: vpcmipstrm
bitRAKE
revolution wrote:
BitRAKE: What you say my not be the case. If the data is in the cache then it can run at full CPU speed. There are many factors to consider when trying to get the speed of code.
Absolutely, that is why I said "in general". I also hope that code is not indicative of the rest. (I'm imagining some fat abstraction layer between this SSE code and some HLL.)

_________________
¯\(°_o)/¯ unlicense.org
Post 30 Nov 2008, 04:12
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17669
Location: In your JS exploiting you and your system
revolution
bitRAKE wrote:
... "in general".
But adnimo never said under what circumstances the code runs. It could be that the entire data usage footprint is <16kB and fits entirely inside the cache, or there could be an earlier stage of processing that has left the live data in the cache. I am sceptical about phrases like "in general" because of the myriad of situations that can occur to make any general case assumptions invalid.
Post 30 Nov 2008, 04:32
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3045
Location: vpcmipstrm
bitRAKE
...and here in this thread we can only work with what he has given. Which is nothing specific enough to make any other comments than the advice you have tried to give. Yet, he wants a general answer.
Post 30 Nov 2008, 06:00
View user's profile Send private message Visit poster's website Reply with quote
asmfan



Joined: 11 Aug 2006
Posts: 392
Location: Russian
asmfan
It will be faster. burst reads/writes. maximum parallelism on registers.
also - no need to load memory to register:
mulps xmm0, dqword[v_right]
can work too.
Post 30 Nov 2008, 13:30
View user's profile Send private message Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel
Hi Adnimo,

in the end you really got to make a small benchmark to measure if this or that is slower. Breaking dependancy chains by reordering code and of course using more registers should benefit normally, at least it shouldn't slow it down.

Just some other recommendations. As most modern cores have different more or less specialized parallel instruction units, where one may be can do a MUL but the other not, sometimes it can also make sense to do it more or less in a quite not good readable way like that (it's often really a matter of try and error and AMD's show different behaviour than Intels):
Code:
        movups xmm0, dqword[v_left]
 movups xmm1, dqword[v_right]
        movups xmm2, dqword[v_left+16]
      mulps xmm0, xmm1
    movups xmm3, dqword[v_right+16]
     movups dqword[v_result], xmm0
       movups xmm4, dqword[v_left+32]
      mulps xmm2, xmm3
    movups xmm5, dqword[v_right+32]
     movups xmm6, dqword[v_left+48]
      mulps xmm4, xmm5
    movups dqword[v_result+16], xmm2
    movups xmm7, dqword[v_right+48]
     mulps xmm6, xmm7
    movups dqword[v_result+32], xmm4
    movups dqword[v_result+48], xmm6    
Post 30 Nov 2008, 16:31
View user's profile Send private message Visit poster's website Reply with quote
adnimo



Joined: 18 Jul 2008
Posts: 49
adnimo
I don't give details because I'm asking "in general", should aligning the calls like that be a good idea or not, that's all I ask Smile But if you insist, I won't be processing a lot of data on each call, just a bunch of matrices.

Regarding the examples: I can't find a difference in speed with my benchmark because my CPU is quite old... both examples ran at almost exactly the same time and after averaging all the results I found out there was no difference in speed, but that's just in this old, single core CPU...

I know about loading memory directly to mulps but on my testbed this causes a MAV, I don't know if it's due to the fact that this CPU has an early SSE implementation or perhaps the HLL on top is messing things up.

I was wondering however, if placing calls like that would generally benefit from speed. It's just one of those things I wonder about when I'm learning new instructions, etc. The "what if..." factor.

Also since there are 8 available registers in 32bit mode it made perfectly good sense to use them all whenever possible.

I know about the "premature optimization" deal and I've been reading the intel optimization manual (slowly but surely) yet, this is me learning on my spare time so excuse any major stupidity from my side. I just find it fascinating, the whole execution speed deal. But don't worry, I always code functionality before I optimize anything Razz
Post 30 Nov 2008, 17:26
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17669
Location: In your JS exploiting you and your system
revolution
adnimo: I suggest to you one small thing to try. Deliberately put in 10 or 20 NOPs and see if there is any difference in speed. If you cannot see any significant difference then I suggest that optimising will not benefit you in any practical way.
Post 30 Nov 2008, 18:00
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel
adnimo wrote:
Regarding the examples: I can't find a difference in speed with my benchmark because my CPU is quite old... both examples ran at almost exactly the same time and after averaging all the results I found out there was no difference in speed, but that's just in this old, single core CPU...

...actually some of the optimization that I did for my Mandelbrot bench didn't do much on old CPU's, like AMD K7, Pentium III/IV, but a lot for C2D, so it might be worth running some test-variations of your code on a C2D and look if there are differences...and despite the Intel manuals, make sure to read Agner Fog's manuals: http://www.agner.org/optimize/
Post 30 Nov 2008, 18:56
View user's profile Send private message Visit poster's website Reply with quote
asmfan



Joined: 11 Aug 2006
Posts: 392
Location: Russian
asmfan
you're right about access violation. if src in mulp* isn't aligned on 16 bytes according to manuals.
Post 30 Nov 2008, 19:44
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.