flat assembler
Message board for the users of flat assembler.

Index > Main > Nice way of summing xmm regs?

Author
Thread Post new topic Reply to topic
Fred



Joined: 22 Oct 2010
Posts: 39
Fred 05 Jul 2011, 00:27
Ok, so I have 5 xmm registers, each one containing 4 floats. What's a good way of adding it all together into one float?
Code:
        addps    xmm0,xmm2
        addps    xmm4,xmm6
        addps    xmm0,xmm1
        addps    xmm0,xmm4
        haddps   xmm0,xmm0
        haddps   xmm0,xmm0    

Works, but not sure if it's the best solution. Anyone got a better? I'm too tired at the moment to figure out if haddps can be used in a smarter way. Or any other instruction for that matter, hah.
Post 05 Jul 2011, 00:27
View user's profile Send private message Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4330
Location: Now
edfed 05 Jul 2011, 11:06
exactlly like with GP instruction, if you want to add 5 registers, do that:
Code:
add reg0,reg1
add reg2,reg3
add reg4,reg2
add reg0,reg4
    

i don't know why you overcomplicated your code with 2 haddps.
Post 05 Jul 2011, 11:06
View user's profile Send private message Visit poster's website Reply with quote
Fred



Joined: 22 Oct 2010
Posts: 39
Fred 05 Jul 2011, 11:57
Well, I used haddps to add together all four floats in xmm0. I can't do that with addps?
Post 05 Jul 2011, 11:57
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 05 Jul 2011, 12:05
@Fred
That's the best way. In actual code you'd probably want to interleave other unrelated instructions to avoid read/write to xmm0 four times. Interleaving other instructions/logic will allow the processor to better use its out-of-order execution.

@edfed
Because the XMM registers hold packed data
xmm0 = [96-127][64-95][32-63][0-31]
The horizontal add packed single (HADDPS) is needed to get the SUM in a one 32bit single.
Post 05 Jul 2011, 12:05
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Fred



Joined: 22 Oct 2010
Posts: 39
Fred 05 Jul 2011, 12:08
r22: Okay, I'll go with that. Also, thanks for the tips. Smile
Post 05 Jul 2011, 12:08
View user's profile Send private message Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4330
Location: Now
edfed 06 Jul 2011, 00:29
Laughing ok, i did'nt notice the question about adding the 4 floats inside the single xmm0 register.

i suspect there is a way to avoid the use of two instructions, and use some tree operands instructions to do the job with les instructions, but if you use only 5 registers, it is pointless to hope to do better. but possible i think. using some very specific instruction, or maybe, wait for a new instruction that can add all xmm registers and components in one instruction.

if you are in 64 bits, with the 16 xmm registers, trying to find a better way can be a need. but here, it can't be many more faster.

compare with the addition of 4*5=20 floats with X87 code... Rolling Eyes
Post 06 Jul 2011, 00:29
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 06 Jul 2011, 08:31
Maybe you need to benchmark it first (if speed is crucial and you don't care about memory). I've read that 65nm Core CPU took 6/7 clocks + 9 latency for a single HADDPS, but they've made it better in the 45 and 32nm parts, where it only takes 3/4 + 7.

Just taking the pure numbers, the older SSE3-enabled CPUs can take 30 clocks or more to execute 2 HADDPS instructions. You can do A LOT of interleaving Smile

This:
Code:
mov edi,dump
movaps [edi],xmm0
fld [edi]
fadd [edi+4]
fadd [edi+8]
fadd [edi+12]
fstp [edi]
    

might be faster. I haven't tested this code (and I havent touched FPU for a long time) so be careful copy-pasting it.

Just for the fun of it, here's a pure SSE version:
Code:
                                ;LOW  HIGH
;            xmm0 =             ; A B C D
movhlps xmm7,xmm0               ; C D ? ?
addps   xmm7,xmm0               ; A+C B+D C+? D+?
pshufd  xmm0,xmm7,10'11'00'01b  ; B+D A+C D+? C+?
addss   xmm0,xmm7               ; A+C+B+D ...
    

EDIT: code corrected!
You can use SHUFPS instead of PSHUFD, but SHUFPS takes a few clocks more. You can gain from it if the penalty from switching between FLOAT->INT->FLOAT is too great (PSHUFD is an integer instruction).

According to my calculations adding this way will cost you 15 bytes and 15 clocks (2+4+5+4) on the same 65nm Core architecture, where 2xHADDPS takes 30 clocks.
So you have a choice: twice the clocks & half the bytes (HADDPS solution takes 8 bytes) or vice versa Smile


Last edited by Madis731 on 06 Jul 2011, 12:52; edited 1 time in total
Post 06 Jul 2011, 08:31
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Fred



Joined: 22 Oct 2010
Posts: 39
Fred 06 Jul 2011, 12:29
Quote:
Maybe you need to benchmark it first (if speed is crucial and you don't care about memory).

I don't really have a way of benchmarking my code, not really sure on how to do that. However, this program isn't really speed sensitive, just thought my code looked a bit silly. Doing it without SSE3 would be a good idea though.

Quote:
Code:
                                ;LOW  HIGH
;            xmm0 =             ; A B C D
movhlps xmm7,xmm0               ; C D ? ?
addps   xmm7,xmm0               ; A+C B+D C+? D+?
pshufd  xmm7,xmm7,10'11'00'01b  ; B+D A+C D+? C+?
addss   xmm0,xmm7               ; A+C+B+D ...     

Hmm, are you sure this is correct? Seems to me that you'd get ABD in xmm[31:0].


Quote:
You can use SHUFPS instead of PSHUFD, but SHUFPS takes a few clocks more. You can gain from it if the penalty from switching between FLOAT->INT->FLOAT is too great (PSHUFD is an integer instruction).

Not sure what you're saying... should I use shufps? Then we'd avoid this float->int->float mode thing. Razz
Post 06 Jul 2011, 12:29
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 06 Jul 2011, 12:39
Sorry, the correct code is:
Code:
                                ;LOW  HIGH
;            xmm0 =             ; A B C D
movhlps xmm7,xmm0               ; C D ? ?
addps   xmm7,xmm0               ; A+C B+D C+? D+?
pshufd  xmm0,xmm7,10'11'00'01b  ; B+D A+C D+? C+?
addss   xmm0,xmm7               ; A+C+B+D ... 
    

and you cannot use SHUFPS because PSHUFD is the only shuffle instruction that can take 4 floats from the *same* xmm register and shuffle them. So you can forget about what I said... Razz

...but the last instruction can be ADDPS or ADDSS, doesn't matter, the rest are junk anyway Smile
Post 06 Jul 2011, 12:39
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Fred



Joined: 22 Oct 2010
Posts: 39
Fred 06 Jul 2011, 12:50
Works great! Nice use of pshufd, had forgotten about that one... stupid shufps. =(
THanks for the help!
Post 06 Jul 2011, 12:50
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.