flat assembler
Message board for the users of flat assembler.

Index > Main > What to do with new instructions?

Author
Thread Post new topic Reply to topic
aq83326



Joined: 25 Jun 2011
Posts: 21
aq83326 05 Jan 2012, 06:00
Okay, SSE extensions aren't new, but I learned assembly a long time ago.
They have new instructions that go way beyond what I know by heart.
It's like learning a new programming language.

Where can I find some snippets of people doing really cool sh** with SSE instructions? (Feel free to post yours here)
I don't mean just for performance sake, just how these things get used and what they are good for. Old algorithms done in an interesting way with these new instructions, stuff like that.

I have downloaded the manuals from intel and another one which is good but a lot to take in:
IntelĀ® SSE4 Programming Reference:
http://edc.intel.com/Link.aspx?id=1630
Post 05 Jan 2012, 06:00
View user's profile Send private message Visit poster's website Reply with quote
cod3b453



Joined: 25 Aug 2004
Posts: 618
cod3b453 05 Jan 2012, 18:53
The main difference is SSE is a SIMD instruction set which performs the same action on multiple (packed) values. The majority of the SSE2/3 set operates on 4x32bit or 2x64bit values per register but includes some 8bit and 16bit operations as well for both integer and reals. I'm not really familiar enough with the later versions but they're more specialised towards codecs afaik.

In general, these instructions are useful for computationally intensive operations like vectors/matrices, more complex mathematical functions or even just fast memory copying.

Below is an example from my OS's software graphics driver for alpha blending 4 pixels per loop using fixed point integer operations:
Code:
               pxor xmm0,xmm0
              movdqa xmm10,dqword [const_00010101000101010001010100010101]
                movdqa xmm11,dqword [const_00000001000100010000000100010001]
                movdqa xmm12,dqword [const_FF00FF00FF00FF00FF00FF00FF00FF00]
                movdqa xmm13,dqword [const_00FFFFFF00FFFFFF00FFFFFF00FFFFFF]
                movdqa xmm14,dqword [const_0000000000000000FFFFFFFFFFFFFFFF]
                movdqa xmm15,dqword [const_FFFFFFFFFFFFFFFF0000000000000000]

            mov rax,16
          xor rcx,rcx
         movzx rdx,dword [buffer_size]
               shr rdx,4
           mov rsi,qword [plSrc]
               mov rdi,qword [plDst]


                                               ; F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
            mov rsi,[rsi+LAYER.pBuffer]     ; A3 R3 G3 B3 A2 R2 G2 B2 A1 R1 G1 B1 A0 R0 G0 B0
           mov rdi,[rdi+LAYER.pBuffer]     ; A7 R7 G7 B7 A6 R6 G6 B6 A5 R5 G5 B5 A4 R4 G4 B4

               align 4

.loop:

               movdqa xmm1,dqword [rsi+rcx]    ; xmm1 = A3 R3 G3 B3 A2 R2 G2 B2 A1 R1 G1 B1 A0 R0 G0 B0
            movdqa xmm2,xmm1                ; xmm2 = A3 R3 G3 B3 A2 R2 G2 B2 A1 R1 G1 B1 A0 R0 G0 B0

                punpcklbw xmm1,xmm0             ; xmm1 = 00 A1 00 R1 00 G1 00 B1 00 A0 00 R0 00 G0 00 B0
            punpckhbw xmm2,xmm0             ; xmm2 = 00 A3 00 R3 00 G3 00 B3 00 A2 00 R2 00 G2 00 B2

                pshuflw xmm3,xmm1,const_3_3_3_3 ; xmm3 = 00 A1 00 R1 00 G1 00 B1 00 A0 00 A0 00 A0 00 A0
            pand xmm3,xmm14                 ; xmm3 = 00 00 00 00 00 00 00 00 00 A0 00 A0 00 A0 00 A0
            pshufhw xmm8,xmm1,const_3_3_3_3 ; xmm8 = 00 A1 00 A1 00 A1 00 A1 00 A0 00 R0 00 G0 00 B0
            pand xmm8,xmm15                 ; xmm8 = 00 A1 00 A1 00 A1 00 A1 00 00 00 00 00 00 00 00
            por xmm3,xmm8                   ; xmm3 = 00 A1 00 A1 00 A1 00 A1 00 A0 00 A0 00 A0 00 A0

                pshuflw xmm4,xmm2,const_3_3_3_3 ; xmm4 = 00 A3 00 R3 00 G3 00 B3 00 A2 00 A2 00 A2 00 A2
            pand xmm4,xmm14                 ; xmm4 = 00 00 00 00 00 00 00 00 00 A2 00 A2 00 A2 00 A2
            pshufhw xmm8,xmm2,const_3_3_3_3 ; xmm8 = 00 A3 00 A3 00 A3 00 A3 00 A2 00 R2 00 G2 00 B2
            pand xmm8,xmm15                 ; xmm8 = 00 A3 00 A3 00 A3 00 A3 00 00 00 00 00 00 00 00
            por xmm4,xmm8                   ; xmm4 = 00 A3 00 A3 00 A3 00 A3 00 A2 00 A2 00 A2 00 A2

                psllw xmm3,0x8                  ; xmm3 = A1 00 A1 00 A1 00 A1 00 A0 00 A0 00 A0 00 A0 00
            psllw xmm4,0x8                  ; xmm4 = A3 00 A3 00 A3 00 A3 00 A2 00 A2 00 A2 00 A2 00

                movdqa xmm5,xmm3                ; xmm5 = A1 00 A1 00 A1 00 A1 00 A0 00 A0 00 A0 00 A0 00
            movdqa xmm6,xmm4                ; xmm6 = A3 00 A3 00 A3 00 A3 00 A2 00 A2 00 A2 00 A2 00

                pxor xmm5,xmm12                 ; xmm5 = V1 00 V1 00 V1 00 V1 00 V0 00 V0 00 V0 00 V0 00
            pxor xmm6,xmm12                 ; xmm6 = V3 00 V3 00 V3 00 V3 00 V2 00 V2 00 V2 00 V2 00

                movdqa xmm8,dqword [rdi+rcx]    ; xmm8 = 00 R7 G7 B7 00 R6 G6 B6 00 R5 G5 B5 00 R4 G4 B4
            movdqa xmm9,xmm8                ; xmm9 = 00 R7 G7 B7 00 R6 G6 B6 00 R5 G5 B5 00 R4 G4 B4

                punpcklbw xmm8,xmm0             ; xmm8 = 00 00 00 R5 00 G5 00 B5 00 00 00 R4 00 G4 00 B4
            paddusw xmm8,xmm11              ; xmm8 = 00 00 00 R5 00 G5 00 B5 00 00 00 R4 00 G4 00 B4
            punpckhbw xmm9,xmm0             ; xmm9 = 00 00 00 R7 00 G7 00 B7 00 00 00 R6 00 G6 00 B6
            paddusw xmm9,xmm11              ; xmm9 = 00 00 00 R7 00 G7 00 B7 00 00 00 R6 00 G6 00 B6

                pmulhuw xmm1,xmm3               ; xmm1 = 00 II 00 X1 00 X1 00 X1 00 II 00 X0 00 X0 00 X0
            pmulhuw xmm2,xmm4               ; xmm2 = 00 II 00 X3 00 X3 00 X3 00 II 00 X2 00 X2 00 X2
            pmulhuw xmm8,xmm5               ; xmm8 = 00 II 00 Y1 00 Y1 00 Y1 00 II 00 Y0 00 Y0 00 Y0
            pmulhuw xmm9,xmm6               ; xmm9 = 00 II 00 Y3 00 Y3 00 Y3 00 II 00 Y2 00 Y2 00 Y2

                packuswb xmm1,xmm2              ; xmm1 = II X3 X3 X3 II X2 X2 X2 II X1 X1 X1 II X0 X0 X0
            packuswb xmm8,xmm9              ; xmm8 = II Y3 Y3 Y3 II Y2 Y2 Y2 II Y1 Y1 Y1 II Y0 Y0 Y0

                paddusb xmm1,xmm8               ; xmm1 = II Z3 Z3 Z3 II Z2 Z2 Z2 II Z1 Z1 Z1 II Z0 Z0 Z0
            paddusb xmm1,xmm10              ; xmm1 = II Z3 Z3 Z3 II Z2 Z2 Z2 II Z1 Z1 Z1 II Z0 Z0 Z0

                pand xmm1,xmm13                 ; xmm1 = 00 Z3 Z3 Z3 00 Z2 Z2 Z2 00 Z1 Z1 Z1 00 Z0 Z0 Z0

                movdqa dqword [rdi+rcx],xmm1    ;

               add rcx,rax
         cmp rcx,rdx
         jb .loop    
This is not fully optimised but shows how much can be done with 2 loads, 1 store, 4 multiplies and some addition/packing; with GPRs this would be something like 24 loads, 12 stores, 24 multiplies but less packing.

The key points are that memory accesses are reduced by performing fewer, larger accesses and the computational path is dramatically shorter because the operations are not (inter-)dependent, not sharing a resource [as often] (in this case the multiplier) and can therefore occur simultaneously.
Post 05 Jan 2012, 18:53
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.