flat assembler
Message board for the users of flat assembler.
![]() |
Author |
|
revolution
Code: VBROADCASTSS xmm1,[m32] ![]() |
|||
![]() |
|
revolution
If you don't want to wait for the later CPUs to be released you can have it now with:
Code: vld1.32 {d0[],d1[]},[r0] although you will need and ARM CPU for it to work ![]() |
|||
![]() |
|
tthsqe
Ok, I get it. It is not possible right now. Ha - is arm ahead of x86?
|
|||
![]() |
|
revolution
tthsqe wrote: is arm ahead of x86? |
|||
![]() |
|
ass0
Anyway you are impliying that ones are evolving faster than others...
_________________ ![]() Nombre: Aquiles Castro. Location2: about:robots |
|||
![]() |
|
Madis731
Ofcourse there is:
shufps xmm0, dword [],0 pshufd xmm0, dword [],0 Why wouldn't you use them? It looks just like you are looking for an instruction IMUL4 eax shortcut, when you already have SHL eax,2 ![]() |
|||
![]() |
|
LocoDelAssembly
Quote:
Because the operand size in your code is not really available? ![]() |
|||
![]() |
|
baldr
tthsqe,
Single pshufd xmmx, dqword [mem32], 0 would suffice, if you don't mind #GP when mem32 is not properly aligned. |
|||
![]() |
|
LocoDelAssembly
BTW, besides the problem pshufd has regarding memory alignment (which makes its use with float arrays impossible), and that it has the extra need for SSE2, could it incur in some performance hit? pshufd will probably mark the two halves of the xmm register as INT, so the next floating point operation MAY be penalized for that, no?
|
|||
![]() |
|
Madis731
If you want to load unaligned dwords, you can do this:
Code: pshufd xmm0,[mem32],00000000b ;first dword pshufd xmm0,[mem32],01010101b ;second dword pshufd xmm0,[mem32],10101010b ;--- pshufd xmm0,[mem32],11111111b ;last (4th) I know they're immediates, but there's always a way in your code to determine where you load your data. pshufd will switch to INT indeed and you pay a clock for that but in my experience its too small to notice. MOVSS intrinsic is not encouraged by Intel and they say the use of MOVPS/MOVPD is better in this case. And if you don't want to shuffle between INT/FPU, you can always do the all-INT way (and still use MOV*): Code: movdqa xmm0,[mem32] pshufd xmm0,xmm0,0 but I don't see it beating pshufd xmm0,[mem32],0 in speed nor size. |
|||
![]() |
|
tthsqe
Silly me for thinking that the integer version
Code: pshufd xmm0,[mem32],0 of Code: shufps xmm0,[mem32],0 would shuffle them the same way. I just assumed they would be consistent. ![]() I think i'll accept the fourfold increase in code size and go with that last one by Madis. |
|||
![]() |
|
Madis731
Oh dear - of course - the shufps will take BOTH inputs and shuffle them. That is why I always use the packed one. Oops!
Who's the bad boy here Intel? AMD? both? |
|||
![]() |
|
revolution
Madis731 wrote: Who's the bad boy here Intel? AMD? both? |
|||
![]() |
|
LocoDelAssembly
Quote: I know they're immediates, but there's always a way in your code to determine where you load your data. ![]() |
|||
![]() |
|
Madis731
This simple SSE question has grown out of hands
![]() You can first load xmm0 with movdqa const[1.0,2.0,3.0,4.0] then add const[4.0,4.0,4.0,4.0] to this register every loop. Now you can effectively use these constants to calculate linearly every number you want. Actually Intel C Compiler will optimize float loop counters all by itself. Of course it prefers integer loop indexes. |
|||
![]() |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.
Website powered by rwasa.