flat assembler
Message board for the users of flat assembler.

Index > DOS > Shortest code to spread a WORD to SSE reg

Author
Thread Post new topic Reply to topic
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 18 Oct 2019, 06:32
I'm currently evaluating the possibilities using SSE for some intro size coding using good old real mode. I'm trying to find the shortest code to bring a general purpose register WORD to an SSE register and spread it on all 8 words within the SSE register. I come up with this, but may be there's a shorter way ?
Code:
mov [si],ax           => 2 Bytes
pshuflw xmm0,[si],0   => 5 Bytes
movddup xmm0,xmm0     => 4 Bytes    
I assume [si] points to an m128 aligned memory address. So the code above assembles to 11 Bytes.
Post 18 Oct 2019, 06:32
View user's profile Send private message Visit poster's website Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8349
Location: Kraków, Poland
Tomasz Grysztar 18 Oct 2019, 09:26
What is your target CPU generation? With AVX this becomes trivial:
Code:
mov [si],ax
vpbroadcastw xmm0,[si]    
Post 18 Oct 2019, 09:26
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 18 Oct 2019, 12:39
...just as far as I know, the target (FreeDOS, or any DOS) at real mode doesn't seem to support AVX even if the CPU supports it...I'll test again when I've access to my computer, but that's what I googled IIRC.


Last edited by Kuemmel on 18 Oct 2019, 12:45; edited 1 time in total
Post 18 Oct 2019, 12:39
View user's profile Send private message Visit poster's website Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8349
Location: Kraków, Poland
Tomasz Grysztar 18 Oct 2019, 12:45
Oh, right, I forgot that VEX prefix only works in 16-bit protected mode, not in real mode or V86.
Post 18 Oct 2019, 12:45
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 18 Oct 2019, 12:48
...I read something here about activating AVX, but seems to have a lot of overhead then for sizecoding...
http://masm32.com/board/index.php?topic=4134.0
Post 18 Oct 2019, 12:48
View user's profile Send private message Visit poster's website Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8349
Location: Kraków, Poland
Tomasz Grysztar 18 Oct 2019, 12:55
And even when you enable AVX for protected mode, the VEX-prefixed instruction are still not going to work in real mode, by design.
Post 18 Oct 2019, 12:55
View user's profile Send private message Visit poster's website Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8349
Location: Kraków, Poland
Tomasz Grysztar 18 Oct 2019, 13:03
Anyway, as you seem to have at least SSE2, then this is something that comes to my mind:
Code:
mov di,si               ; 2 bytes
stosw                   ; 1 byte
stosw                   ; 1 byte
pshufd xmm0,[si],0      ; 5 bytes    
But this assumes ES=DS in addition to the assumption about SI that you provided.
Post 18 Oct 2019, 13:03
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 18 Oct 2019, 15:56
Thanks for the hint Smile ! With shufps (which can also used for integer and is 1 byte shorter than pshufd) we can go down one byte, as I need to preserve DI =>
Code:
push di                 ; 1 byte
mov di,si               ; 2 bytes
stosw                   ; 1 byte
stosw                   ; 1 byte
shufps xmm0,[si],0      ; 4 bytes
pop di                  ; 1 byte    
My cpu is capable of all SSE sets...but couldn't find anything usefull for that task on the later SSE versions...
Post 18 Oct 2019, 15:56
View user's profile Send private message Visit poster's website Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8349
Location: Kraków, Poland
Tomasz Grysztar 18 Oct 2019, 17:03
I'm afraid SHUFPS is not going to work here, since it can put values from the second operand only into high portion of the destination register.

But, since you need to preserve DI, I have another idea:
Code:
push ax                 ; 1 byte
push ax                 ; 1 byte
pop dword [si]          ; 3 bytes
pshufd xmm0,[si],0      ; 5 bytes    


Last edited by Tomasz Grysztar on 18 Oct 2019, 19:10; edited 1 time in total
Post 18 Oct 2019, 17:03
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 18 Oct 2019, 17:57
Thanks ! Totally forgot about the issue with shufps. Just one remark, "pop dword[si]" assembles to 3 bytes due to real mode (0x66 prefix), but still one byte saved Smile

EDIT: I did some speed profiling. It seems the push/pop seems a bit slower compared to the following which uses the same amount of bytes in total:
Code:
mov word[si],ax
mov word[si+2],ax
pshufd xmm0,[si],0    
Post 18 Oct 2019, 17:57
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4016
Location: vpcmpistri
bitRAKE 19 Oct 2019, 01:55
A rare corner case would allow the following code:
Code:
movd xmm0,eax
pshufb xmm0,[si]    
Assuming [SI] contains the needed constant value. Most likely not possible.
Post 19 Oct 2019, 01:55
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.