flat assembler
Message board for the users of flat assembler.
Index
> Main > Nice way of summing xmm regs? |
Author |
|
edfed 05 Jul 2011, 11:06
exactlly like with GP instruction, if you want to add 5 registers, do that:
Code: add reg0,reg1 add reg2,reg3 add reg4,reg2 add reg0,reg4 i don't know why you overcomplicated your code with 2 haddps. |
|||
05 Jul 2011, 11:06 |
|
Fred 05 Jul 2011, 11:57
Well, I used haddps to add together all four floats in xmm0. I can't do that with addps?
|
|||
05 Jul 2011, 11:57 |
|
r22 05 Jul 2011, 12:05
@Fred
That's the best way. In actual code you'd probably want to interleave other unrelated instructions to avoid read/write to xmm0 four times. Interleaving other instructions/logic will allow the processor to better use its out-of-order execution. @edfed Because the XMM registers hold packed data xmm0 = [96-127][64-95][32-63][0-31] The horizontal add packed single (HADDPS) is needed to get the SUM in a one 32bit single. |
|||
05 Jul 2011, 12:05 |
|
Fred 05 Jul 2011, 12:08
r22: Okay, I'll go with that. Also, thanks for the tips.
|
|||
05 Jul 2011, 12:08 |
|
edfed 06 Jul 2011, 00:29
ok, i did'nt notice the question about adding the 4 floats inside the single xmm0 register.
i suspect there is a way to avoid the use of two instructions, and use some tree operands instructions to do the job with les instructions, but if you use only 5 registers, it is pointless to hope to do better. but possible i think. using some very specific instruction, or maybe, wait for a new instruction that can add all xmm registers and components in one instruction. if you are in 64 bits, with the 16 xmm registers, trying to find a better way can be a need. but here, it can't be many more faster. compare with the addition of 4*5=20 floats with X87 code... |
|||
06 Jul 2011, 00:29 |
|
Madis731 06 Jul 2011, 08:31
Maybe you need to benchmark it first (if speed is crucial and you don't care about memory). I've read that 65nm Core CPU took 6/7 clocks + 9 latency for a single HADDPS, but they've made it better in the 45 and 32nm parts, where it only takes 3/4 + 7.
Just taking the pure numbers, the older SSE3-enabled CPUs can take 30 clocks or more to execute 2 HADDPS instructions. You can do A LOT of interleaving This: Code: mov edi,dump movaps [edi],xmm0 fld [edi] fadd [edi+4] fadd [edi+8] fadd [edi+12] fstp [edi] might be faster. I haven't tested this code (and I havent touched FPU for a long time) so be careful copy-pasting it. Just for the fun of it, here's a pure SSE version: Code: ;LOW HIGH ; xmm0 = ; A B C D movhlps xmm7,xmm0 ; C D ? ? addps xmm7,xmm0 ; A+C B+D C+? D+? pshufd xmm0,xmm7,10'11'00'01b ; B+D A+C D+? C+? addss xmm0,xmm7 ; A+C+B+D ... EDIT: code corrected! You can use SHUFPS instead of PSHUFD, but SHUFPS takes a few clocks more. You can gain from it if the penalty from switching between FLOAT->INT->FLOAT is too great (PSHUFD is an integer instruction). According to my calculations adding this way will cost you 15 bytes and 15 clocks (2+4+5+4) on the same 65nm Core architecture, where 2xHADDPS takes 30 clocks. So you have a choice: twice the clocks & half the bytes (HADDPS solution takes 8 bytes) or vice versa Last edited by Madis731 on 06 Jul 2011, 12:52; edited 1 time in total |
|||
06 Jul 2011, 08:31 |
|
Fred 06 Jul 2011, 12:29
Quote: Maybe you need to benchmark it first (if speed is crucial and you don't care about memory). I don't really have a way of benchmarking my code, not really sure on how to do that. However, this program isn't really speed sensitive, just thought my code looked a bit silly. Doing it without SSE3 would be a good idea though. Quote:
Hmm, are you sure this is correct? Seems to me that you'd get ABD in xmm[31:0]. Quote: You can use SHUFPS instead of PSHUFD, but SHUFPS takes a few clocks more. You can gain from it if the penalty from switching between FLOAT->INT->FLOAT is too great (PSHUFD is an integer instruction). Not sure what you're saying... should I use shufps? Then we'd avoid this float->int->float mode thing. |
|||
06 Jul 2011, 12:29 |
|
Madis731 06 Jul 2011, 12:39
Sorry, the correct code is:
Code: ;LOW HIGH ; xmm0 = ; A B C D movhlps xmm7,xmm0 ; C D ? ? addps xmm7,xmm0 ; A+C B+D C+? D+? pshufd xmm0,xmm7,10'11'00'01b ; B+D A+C D+? C+? addss xmm0,xmm7 ; A+C+B+D ... and you cannot use SHUFPS because PSHUFD is the only shuffle instruction that can take 4 floats from the *same* xmm register and shuffle them. So you can forget about what I said... ...but the last instruction can be ADDPS or ADDSS, doesn't matter, the rest are junk anyway |
|||
06 Jul 2011, 12:39 |
|
Fred 06 Jul 2011, 12:50
Works great! Nice use of pshufd, had forgotten about that one... stupid shufps. =(
THanks for the help! |
|||
06 Jul 2011, 12:50 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.