flat assembler
Message board for the users of flat assembler.

Index > Main > Adding up individual floating point values in a xmm register

Author
Thread Post new topic Reply to topic
Raedwulf



Joined: 13 Jul 2005
Posts: 375
Location: United Kingdom
Raedwulf 02 Apr 2006, 06:12
Hi:

I was wondering what would be the fastest way to add up a single xmm register.
e.g.
A xmm register contains 4 single floating pt values.
x3,x2,x1,x0
How would you get the result of x3+x2+x1+x0
If i recall SSE3 has a faster way for doing this 'horizontal?' addition? But I need the code in SSE (not SSE2 or SSE3 etc.)

Cheers

_________________
Raedwulf
Post 02 Apr 2006, 06:12
View user's profile Send private message MSN Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20454
Location: In your JS exploiting you and your system
revolution 02 Apr 2006, 07:23
Assuming your source register is xmm0
Code:
movhlps xmm1,xmm0
addps xmm0,xmm1
shufps xmm1,xmm0,1
addps xmm0,xmm1    
Post 02 Apr 2006, 07:23
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 02 Apr 2006, 09:49
revolution, you have an error in your code. The result of a shuffle gets defined by BOTH operands and you can't get the source's lower into destination's lower.

Give 1.0 2.0 3.0 4.0 IN:
Code:
        movhlps xmm1,xmm0
        addps   xmm0,xmm1
        movaps  xmm1,xmm0
        shufps  xmm1,xmm1,1
        addps   xmm0,xmm1 ;or even addss because its 1µop faster
    

and get X X X 10.0 OUT

Give 1.0 2.0 3.0 4.0 IN:
Code:
        movaps  xmm1,xmm0
        shufps  xmm1,xmm1,00011011b
        addps   xmm0,xmm1
        movaps  xmm1,xmm0
        shufps  xmm1,xmm1,01001110b
        addps   xmm0,xmm1
    

and get 10.0 10.0 10.0 10.0 OUT

Maybe useful sometimes. You can use the results in the XMM0 at the beginning of the edge of the 11th and 13th clock respectively.

Smile I looked at SSE3 and it really is simpler in instructions to implement:
Code:
        haddps  xmm0,xmm0
        haddps  xmm0,xmm0
    

The advantage is that you get all the four DWORDs filled with the same result.
Post 02 Apr 2006, 09:49
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Raedwulf



Joined: 13 Jul 2005
Posts: 375
Location: United Kingdom
Raedwulf 02 Apr 2006, 11:49
Thanks!

This really helps Smile - I wrote some real crappy SSE code to add them up lol - using too many SHUFPS - because my head really gets spun round trying to work out which goes to which.

Btw.... what debugger do you guys use ? - I use Ollydbg but as it doesn't support XMM registers, I have to inconveniently place the values into temporary memory spaces to view their contents (in debug time)

Cheers!

_________________
Raedwulf
Post 02 Apr 2006, 11:49
View user's profile Send private message MSN Messenger Reply with quote
Raedwulf



Joined: 13 Jul 2005
Posts: 375
Location: United Kingdom
Raedwulf 02 Apr 2006, 12:04
Yeah SSE3 is bloody useful in these situations - unfortunately my desktop doesn't even support SSE2! Any willing doners for my humble cause are appreciated Smile
LOL
Cheers!
Post 02 Apr 2006, 12:04
View user's profile Send private message MSN Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20454
Location: In your JS exploiting you and your system
revolution 02 Apr 2006, 13:34
Quote:
revolution, you have an error in your code. The result of a shuffle gets defined by BOTH operands and you can't get the source's lower into destination's lower.
Oh yeah, you are right, thanks for that. I meant to type PSHUFD but got confused with shufps, so my code should be this:
Code:
movhlps xmm1,xmm0 
addps xmm0,xmm1 
pshufd xmm1,xmm0,1 
addps xmm0,xmm1    
Post 02 Apr 2006, 13:34
View user's profile Send private message Visit poster's website Reply with quote
Raedwulf



Joined: 13 Jul 2005
Posts: 375
Location: United Kingdom
Raedwulf 02 Apr 2006, 16:38
Thanks revolution Smile
I'll keep note of that - because that appears to be faster.
However, it uses SSE2 -(PSHUFD) - and my desktop is antique and doesn't support it Sad

Cheers anyway....I'll add it into my program once i have better access to a SSE2 computer.

_________________
Raedwulf
Post 02 Apr 2006, 16:38
View user's profile Send private message MSN Messenger Reply with quote
Reverend



Joined: 24 Aug 2004
Posts: 408
Location: Poland
Reverend 26 May 2006, 16:06
Raedwulf: OllyDbg can show SSE registers. Just go to 'Debuggin options' -> 'Registers' and match 'Decode SSE registers' Smile. But there is an error in OllyDbg which I found and already noticed Oleh Yuschuk. When you check to decode this register the debugger won't catch some exceptions. Most of the times it isn't even noticeable, but I was once traceing something and I was furious, it didn't catch exceptions! Author himself admitted it's an error, but as the 1.10d version is no longer evolving the error remains unfixed.
Post 26 May 2006, 16:06
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.