flat assembler
Message board for the users of flat assembler.

 Index > Main > Implementing SFMD functions the RISC way
Author
mattst88

Joined: 12 May 2006
Posts: 260
Location: South Carolina
mattst88
I'm learning Alpha assembly, and in keeping with the RISC way of doing things, there are only very basic SIMD instructions: minimum and maximum of signed/unsigned words/bytes. That is, {min,max}{u,s}{b,w}

Whereas SSE provides instructions for everything and walks your dog for you, complicated instructions have to be built from basic instructions on the Alpha.

For instance, the SSE/MMX instructions paddusb, psubusb can be implemented as

Code:
```static __inline uint64_t
return m1 + __minub8(m2, ~m1);
}

static __inline uint64_t
__psubusb8(uint64_t m1, uint64_t m2) {
return m1 - __minub8(m2, m1);
}    ```

When I saw these, I thought they were kind of clever. Another interesting one (that I have not confirmed as working) is paddw:

Code:
```__paddw(uint64_t m1, uint64_t m2) {
uint64_t signs = (m1 ^ m2) & 0x8000800080008000;
m1 &= ~signs;
m2 &= ~signs;
m1 += m2;
m1 ^= signs;
return m1;
}    ```

Here's an exercise for you: Using only the following instructions implement an SIMD instruction (SSE/MMX) that operates on byte or word integers.

Available instructions: and, or, xor, not, SIMD min/max for bytes/words, addition/subtraction. Note that the available add/subtract instructions are not SIMD aware and treat operands as single integers.

_________________
My x86 Instruction Reference -- includes SSE, SSE2, SSE3, SSSE3, SSE4 instructions.
Assembly Programmer's Journal

Last edited by mattst88 on 20 Jun 2008, 17:17; edited 1 time in total
14 May 2008, 05:11
LocoDelAssembly

Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
14 May 2008, 16:55
bitRAKE

Joined: 21 Jul 2003
Posts: 3067
Location: vpcmipstrm
bitRAKE
Have you seen HACKMEM?
Fun read for any programmer, imho.

_________________
15 May 2008, 02:31
mattst88

Joined: 12 May 2006
Posts: 260
Location: South Carolina
mattst88
Code:
```__paddw(uint64_t m1, uint64_t m2) {
uint64_t signs = (m1 ^ m2) & 0x8000800080008000;
m1 &= ~signs;
m2 &= ~signs;
m1 += m2;
m1 ^= signs;
return m1;
}    ```

Sure enough, this does seem to work. This function is the equivalent of paddw, even on systems without MMX.

Just one question, can anyone explain the logic to me?

I'm guessing here, but I think it figures out what the resulting words' signs will be, saves them, erases them from the operands, adds the operands, and restores the signs.

_________________
My x86 Instruction Reference -- includes SSE, SSE2, SSE3, SSSE3, SSE4 instructions.
Assembly Programmer's Journal
17 May 2008, 18:51
bitRAKE

Joined: 21 Jul 2003
Posts: 3067
Location: vpcmipstrm
bitRAKE
mattst88 wrote:
Just one question, can anyone explain the logic to me?
m1 &= ~signs;
m2 &= ~signs;

These only clear the sign bits when m1 and m2 are different - meaning a carry takes place if both are set - which screws up the result of higher order words (i.e. it doesn't work exactly).

Using SpeQ I wrote a script to test values:
Code:
```m1=Hex(0xFFF0)
m1 = 0xFFF0
m2=Hex(0xFFF7)
m2 = 0xFFF7
signs=Hex((m1 || m2) & 0x8000)
signs = 0x0
m1=Hex(m1 & (signs||0xFFFF))
m1 = 0xFFF0
m2=Hex(m2 & (signs||0xFFFF))
m2 = 0xFFF7
m1=Hex((m1+m2)||signs)
m1 = 0x1FFE7    ```
I don't like it and would rather just program in x86:
Code:
```; match other code names
m1 equ eax
m2 equ edx
signs equ ecx

; some test values
mov m1,\$FFF0FFF0
mov m2,\$FFF7FFF7

; signs = (m1 ^ m2) & 0x8000800080008000;
mov signs,m1
xor signs,m2
and signs,\$80008000
; m1 &= ~signs;
not signs
and m1,signs
; m2 &= ~signs;
and m2,signs
; m1 += m2;
; m1 ^= signs;
not signs ; reverse above NOT
xor m1,signs    ```
...the high and low word are different.

This works:
Code:
```__paddw(uint64_t m1, uint64_t m2) {
uint64_t d1 = m2 & 0x0000FFFF0000FFFF
uint64_t d2 = m1 & 0xFFFF0000FFFF0000
m1 = m1 & 0x0000FFFF0000FFFF
m2 = m2 & 0xFFFF0000FFFF0000
m1 += d1
m2 += d2

m1 = m1 & 0x0000FFFF0000FFFF
m2 = m2 & 0xFFFF0000FFFF0000
return m1+m2;
}    ```

_________________
18 May 2008, 02:59
LocoDelAssembly

Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
mattst88, from the link I posted check the version previous to bitRAKE's suggestion, maybe that one is clearer to see how and why it works. Adder (maybe helpful to understand the rounding part).

BTW, should I move this thread Heap or another subforum?[/edit]
18 May 2008, 17:27
mattst88

Joined: 12 May 2006
Posts: 260
Location: South Carolina
mattst88
LocoDelAssembly wrote:
mattst88, from the link I posted check the version previous to bitRAKE's suggestion, maybe that one is clearer to see how and why it works. Adder (maybe helpful to understand the rounding part).

Thanks for the code.

Quote:
BTW, should I move this thread Heap or another subforum?[/edit]

I didn't post it there because the Heap has become where all the bs philosophy and clueless physics topics are posted.

_________________
My x86 Instruction Reference -- includes SSE, SSE2, SSE3, SSSE3, SSE4 instructions.
Assembly Programmer's Journal
18 May 2008, 17:32
LocoDelAssembly

Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
OK, moved to Main instead but please try to not keep it in a complete Cish way because it will need to be moved again then.
18 May 2008, 17:51
mattst88

Joined: 12 May 2006
Posts: 260
Location: South Carolina
mattst88
LocoDelAssembly wrote:
OK, moved to Main instead but please try to not keep it in a complete Cish way because it will need to be moved again then.

Ehh.. what?

_________________
My x86 Instruction Reference -- includes SSE, SSE2, SSE3, SSSE3, SSE4 instructions.
Assembly Programmer's Journal
18 May 2008, 19:03
LocoDelAssembly

Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
mattst88 wrote:
LocoDelAssembly wrote:
OK, moved to Main instead but please try to not keep it in a complete Cish way because it will need to be moved again then.

Ehh.. what?

I wonder the same

I wanted to say that since this subforum is for general ASSEMBLY programming stuff it would be best to provide Assembly code that only use non-SIMD instructions and keep HLL contributions at minimun. No problem if the HLL code is paired with its Assembly counterpart though.
18 May 2008, 21:01
mattst88

Joined: 12 May 2006
Posts: 260
Location: South Carolina
mattst88
If I'd posted assembly, it would have been for alpha, and therefore useless for the vast majority of this user group...
18 May 2008, 21:17
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 17884
revolution
mattst88 wrote:
If I'd posted assembly, it would have been for alpha, and therefore useless for the vast majority of this user group...
Hehe, if you had posted C, it would have been for HLL, and therefore useless for the vast majority of this user group..
19 May 2008, 02:56
mattst88

Joined: 12 May 2006
Posts: 260
Location: South Carolina
mattst88
I talked with the guy who wrote the original code and pointed out the overflow bug. He suggested changing one line, which I believe fixes the issue.

Code:
```__paddw(uint64_t m1, uint64_t m2) {
uint64_t signs = (m1 ^ m2) & 0x8000800080008000L;
m1 &= ~0x8000800080008000L;
m2 &= ~signs;
m1 += m2;
m1 ^= signs;
return m1;
}    ```

See if you can find any bugs, or otherwise confirm it is working
14 Jun 2008, 21:43
vid
Verbosity in development

Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid
As long as we discuss algorithm or general ideas, I think posting C code is okay. Things are sometimes bit more readable that way.
15 Jun 2008, 08:27
AlexP

Joined: 14 Nov 2007
Posts: 561
Location: Out the window. Yes, that one.
AlexP
Quote:
Have you seen HACKMEM?
I love how their articles use giant pictures out of ASCII characters to make math symbols, like a root or sigma! That's a very nice reference.
19 Jun 2008, 02:40
edfed

Joined: 20 Feb 2006
Posts: 4246
Location: 2018
edfed
Code:
```Single Instruction, Multiple Datas.
```

Vs
Code:
```Reduced Instruction Set Cpu.
```

they are not the same thing.

if you try to implement it in a risc way, then, it is not SIMD.
it is possible to do Single Function Multiple Datas in RISC way.
but not SIMD in RISC.

OKAY?

only boolean intructions can be for MD.

OP1 and OP2
OP1 or OP2
OP1 xor OP2
not OP1

else are math instructions and are like IIR filter, each elements (bits) depends on the state of its neighbourhood.
OP1 + OP2
OP1 - OP2
neg OP1
OP1 * OP2
OP1 / OP2

these instructions cannot be SIMDed in the RISC way.

speaking about SIMD in RISC like it:
Code:
```mov signs,m1
xor signs,m2
and signs,\$80008000
```

is a lack of culture and an error.

words exist for what they mean.

sorry for my boring attitude, but it is very important to not change the meaning of some words and defnitions
19 Jun 2008, 10:48
mattst88

Joined: 12 May 2006
Posts: 260
Location: South Carolina
mattst88
edfed wrote:
Code:
```Single Instruction, Multiple Datas.
```

Vs
Code:
```Reduced Instruction Set Cpu.
```

they are not the same thing.

I know these are different, but thanks for clarifying ...?

edfed wrote:
if you try to implement it in a risc way, then, it is not SIMD.
it is possible to do Single Function Multiple Datas in RISC way.
but not SIMD in RISC.

That's getting kind of picky. Pedantic, even. But yes, I'm interested in implementing functions using multiple instructions to emulate an MMX/SSE style SIMD instruction for a performance benefit.

edfed wrote:
OKAY?

OKAY!

edfed wrote:
... is a lack of culture and an error.

I beg to differ. I think I'm quite cultured. In fact, my 10th grade english teacher told me so when I played The Ballad of John and Yoko for my project on poetry. Aside: For fuck's sake, it's a Beatles song. It hit #1 on the charts. It's not some obscure song or genre that I'm introducing the world to.

edfed wrote:
words exist for what they mean.

Would changing the topic to one about SFMD (single function, multiple data) appease you?

_________________
My x86 Instruction Reference -- includes SSE, SSE2, SSE3, SSSE3, SSE4 instructions.
Assembly Programmer's Journal
20 Jun 2008, 15:08
edfed

Joined: 20 Feb 2006
Posts: 4246
Location: 2018
edfed
Quote:
Would changing the topic to one about SFMD (single function, multiple data) appease you?

yes, and sorry for the fun.

20 Jun 2008, 16:03
baldr

Joined: 19 Mar 2008
Posts: 1651
baldr
General idea is to prevent carry from MSB of [packed] word to LSB of the adjacent one (I mean, from bit 15 to 16, bit 31 to 32 and bit 47 to 48). So code
Code:
`signs = (m1 ^ m2) & 0x8000800080008000L;    `

Code:
```m1 &= 0x7FFF7FFF7FFF7FFFL;
m2 &= 0x7FFF7FFF7FFF7FFFL;    ```

3) adds the rest bits with carry;
Code:
`m1 += m2;    `

4) adds (1) to (3) without carry;
Code:
`m1 ^= signs;    `

Seems to me, everything is correct.
26 Jul 2008, 19:15
MCD

Joined: 21 Aug 2004
Posts: 604
Location: Germany
MCD
bitRAKE wrote:

This works:
Code:
```__paddw(uint64_t m1, uint64_t m2) {
uint64_t d1 = m2 & 0x0000FFFF0000FFFF
uint64_t d2 = m1 & 0xFFFF0000FFFF0000
m1 = m1 & 0x0000FFFF0000FFFF
m2 = m2 & 0xFFFF0000FFFF0000
m1 += d1
m2 += d2

m1 = m1 & 0x0000FFFF0000FFFF
m2 = m2 & 0xFFFF0000FFFF0000
return m1+m2;
}    ```

I can confirm this

both
mattst88 wrote:

Code:
```__paddw(uint64_t m1, uint64_t m2) {
uint64_t signs = (m1 ^ m2) & 0x8000800080008000;
m1 &= ~signs;
m2 &= ~signs;
m1 += m2;
m1 ^= signs;
return m1;
}    ```

Code:
```__paddw(uint64_t m1, uint64_t m2) {
uint64_t signs = (m1 ^ m2) & 0x8000800080008000L;
m1 &= ~0x8000800080008000L;
m2 &= ~signs;
m1 += m2;
m1 ^= signs;
return m1;
}    ```

doesn't seem to work for me, when I set m1= 0x0000 0000 0000 0003 and m2= 0x0000 0000 0000 8002, for example.
The result should be 0x0000 0000 0000 8005, but I'm getting 0x0000 0000 0000 0000?

Can anyone confirm this or are my computations errorneous?
13 Aug 2008, 13:41
 Display posts from previous: All Posts1 Day7 Days2 Weeks1 Month3 Months6 Months1 Year Oldest FirstNewest First

 Jump to: Select a forum Official----------------AssemblyPeripheria General----------------MainTutorials and ExamplesDOSWindowsLinuxUnixMenuetOS Specific----------------MacroinstructionsOS ConstructionIDE DevelopmentProjects and IdeasNon-x86 architecturesHigh Level LanguagesProgramming Language DesignCompiler Internals Other----------------FeedbackHeapTest Area

Forum Rules:
 You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot vote in polls in this forumYou cannot attach files in this forumYou can download files in this forum