flat assembler
Message board for the users of flat assembler.

Index > Main > simple sse question

Author
Thread Post new topic Reply to topic
tthsqe



Joined: 20 May 2009
Posts: 724
tthsqe
is there an instruction that broadcasts a single precision float from memory to the four slots in a xmmx register? (similar to movddup for doubles)
Currently using :
movss xmm0, dword []
shufps xmm0,xmm0,0
Post 28 Feb 2010, 09:55
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17270
Location: In your JS exploiting you and your system
revolution
Code:
VBROADCASTSS xmm1,[m32]    
Although you will have to wait a while for the instruction to be available. Wink
Post 28 Feb 2010, 11:03
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17270
Location: In your JS exploiting you and your system
revolution
If you don't want to wait for the later CPUs to be released you can have it now with:
Code:
vld1.32 {d0[],d1[]},[r0]    
Loads a single 128bit vector with 4 duplications of a single 32bit memory value.

although you will need and ARM CPU for it to work Razz
Post 28 Feb 2010, 11:40
View user's profile Send private message Visit poster's website Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 724
tthsqe
Ok, I get it. It is not possible right now. Ha - is arm ahead of x86?
Post 28 Feb 2010, 20:15
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17270
Location: In your JS exploiting you and your system
revolution
tthsqe wrote:
is arm ahead of x86?
They are on different race tracks. So trying to judge who is ahead of whom is like trying to decide whether the velodrome riders are ahead of the pool swimmers.
Post 01 Mar 2010, 01:50
View user's profile Send private message Visit poster's website Reply with quote
ass0



Joined: 31 Dec 2008
Posts: 521
Location: ( . Y . )
ass0
Anyway you are impliying that ones are evolving faster than others...

_________________
Image
Nombre: Aquiles Castro.
Location2: about:robots
Post 01 Mar 2010, 01:52
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
Ofcourse there is:
shufps xmm0, dword [],0
pshufd xmm0, dword [],0

Why wouldn't you use them?


It looks just like you are looking for an instruction IMUL4 eax shortcut, when you already have SHL eax,2 Smile I think its really fast.
Post 01 Mar 2010, 07:44
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Quote:

Why wouldn't you use them?

Because the operand size in your code is not really available? Wink
Post 01 Mar 2010, 16:37
View user's profile Send private message Reply with quote
baldr



Joined: 19 Mar 2008
Posts: 1651
baldr
tthsqe,

Single pshufd xmmx, dqword [mem32], 0 would suffice, if you don't mind #GP when mem32 is not properly aligned.
Post 01 Mar 2010, 23:22
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
BTW, besides the problem pshufd has regarding memory alignment (which makes its use with float arrays impossible), and that it has the extra need for SSE2, could it incur in some performance hit? pshufd will probably mark the two halves of the xmm register as INT, so the next floating point operation MAY be penalized for that, no?
Post 02 Mar 2010, 06:17
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
If you want to load unaligned dwords, you can do this:
Code:
pshufd xmm0,[mem32],00000000b ;first dword
pshufd xmm0,[mem32],01010101b ;second dword
pshufd xmm0,[mem32],10101010b ;---
pshufd xmm0,[mem32],11111111b ;last (4th)
    

I know they're immediates, but there's always a way in your code to determine where you load your data.
pshufd will switch to INT indeed and you pay a clock for that but in my experience its too small to notice. MOVSS intrinsic is not encouraged by Intel and they say the use of MOVPS/MOVPD is better in this case.

And if you don't want to shuffle between INT/FPU, you can always do the all-INT way (and still use MOV*):
Code:
movdqa xmm0,[mem32]
pshufd xmm0,xmm0,0
    

but I don't see it beating pshufd xmm0,[mem32],0 in speed nor size.
Post 02 Mar 2010, 07:14
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 724
tthsqe
Silly me for thinking that the integer version
Code:
pshufd xmm0,[mem32],0    

of
Code:
shufps xmm0,[mem32],0    

would shuffle them the same way. I just assumed they would be consistent. Mad

I think i'll accept the fourfold increase in code size and go with that last one by Madis.
Post 02 Mar 2010, 08:03
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
Oh dear - of course - the shufps will take BOTH inputs and shuffle them. That is why I always use the packed one. Oops!

Who's the bad boy here Intel? AMD? both?
Post 02 Mar 2010, 08:42
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17270
Location: In your JS exploiting you and your system
revolution
Madis731 wrote:
Who's the bad boy here Intel? AMD? both?
I expect it is a hangover from the time when all SSE was still only done with 64bit operations internally.
Post 02 Mar 2010, 08:51
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Quote:
I know they're immediates, but there's always a way in your code to determine where you load your data.
Like loop unrolling, something that I've not even sightly thought about when talked about its inability to be used for float arrays Embarassed
Post 02 Mar 2010, 13:45
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
This simple SSE question has grown out of hands Smile but there are some not very simple ideas to even unroll float arrays.

You can first load xmm0 with movdqa const[1.0,2.0,3.0,4.0]
then add const[4.0,4.0,4.0,4.0] to this register every loop.

Now you can effectively use these constants to calculate linearly every number you want. Actually Intel C Compiler will optimize float loop counters all by itself. Of course it prefers integer loop indexes.
Post 02 Mar 2010, 18:05
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar.

Powered by rwasa.