flat assembler - 32bit rgba pixel

Index > Main > 32bit rgba pixel - 16bytes to 4 floats

Author

Thread

jmcclane

Joined: 17 Feb 2013
Posts: 14

jmcclane 09 Jul 2013, 15:45

Please someone help me to speed up this
I can't find any tutorial.... if anyone knows?
I wont to convert 16bytes of rgba pixel to 4 floats...
Thanks!

proc Rgbato4Floats vD:dword, vA:dword

mov eax, [vA]
mov ecx, [vD]

pxor xmm7, xmm7

movdqu xmm0, [eax] ; a3,b3,g3,r3 a2,b2,g2,r2 a1,b1,g1,r1 a0,b0,g0,r0 ; 16 bytes

movdqa xmm2, xmm0
punpcklbw xmm0, xmm7
punpckhbw xmm2, xmm7

movdqa xmm1, xmm0
punpcklwd xmm0, xmm7
cvtdq2ps xmm0, xmm0 ; a0, b0, g0, r0 ; 4 floats
punpckhwd xmm1, xmm7
cvtdq2ps xmm1, xmm1 ; a1, b1, g1, r1 ; 4 floats

movdqa xmm3, xmm2
punpcklwd xmm2, xmm7
cvtdq2ps xmm2, xmm2 ; a2, b2, g2, r2 ; 4 floats
punpckhwd xmm3, xmm7
cvtdq2ps xmm3, xmm3 ; a3, b3, g3, r3 ; 4 floats

movups [ecx+00], xmm0
movups [ecx+16], xmm1
movups [ecx+32], xmm2
movups [ecx+48], xmm3

ret
endp

09 Jul 2013, 15:45

cod3b453

Joined: 25 Aug 2004
Posts: 618

cod3b453 09 Jul 2013, 16:34

If this is getting called a lot it may help to move the loop into this procedure. Any instructions that might be better would require more recent SSSE3+ (pshufb)/AVX (vcvtdq2ps) instruction sets to be supported - are these available?. It may also be worth enforcing aligned buffers so you can use movdqa/movaps for load/stores and reordering the instructions to reduce interdependencies on the punpckxxx/cvtdq2ps stages.

09 Jul 2013, 16:34

jmcclane

Joined: 17 Feb 2013
Posts: 14

jmcclane 09 Jul 2013, 16:45

Thank's... but I want without ssse3+

09 Jul 2013, 16:45

tthsqe

Joined: 20 May 2009
Posts: 773

tthsqe 09 Jul 2013, 18:01

I couldn't get anything faster using <= sse2. It of course helps to have things aligned so that you can use aligned mov's.
Considering that the optimal sse4.1 solution

Code:

irps i, 0 1 2 3 {
pmovzxbd  xmm#i,[eax+4*i]
cvtdq2ps  xmm#i,xmm#i
movaps    [ecx+16*i],xmm#i
}

is only about 20% faster on my machine, I would say that you have hit the throughput limit of the sse instructions. If you need more speed and this function is the bottleneck, I would say get rid of the byte representation in your program and just stick with the 4x larger unpacked float representation.

09 Jul 2013, 18:01

jmcclane

Joined: 17 Feb 2013
Posts: 14

jmcclane 09 Jul 2013, 18:40

Thank's a lot...I have old cpu and dont support sss3+...sse4
so I use sse2 or sse3
I can't test:(

If anyone knows how to convert back to 16 bytes

09 Jul 2013, 18:40

tthsqe

Joined: 20 May 2009
Posts: 773

tthsqe 09 Jul 2013, 18:54

Your solution for 4x byte -> 4x float is fairly good if you stick to see2. Smile

However, may I ask why your program needs a fast byte -> float function?

09 Jul 2013, 18:54

jmcclane

Joined: 17 Feb 2013
Posts: 14

jmcclane 09 Jul 2013, 19:20

I learn.... later to use for little faster effects
on images.....I use vb.net and fasm...
my hobby that I love...

09 Jul 2013, 19:20

tthsqe

Joined: 20 May 2009
Posts: 773

tthsqe 10 Jul 2013, 13:52

jmcclane,
to invert what you are doing, all you need to do reverse the instruction order and invert each operation.

Code:

cvtdq2ps -> cvtps2dq
punpcklwd+punpckhwd -> packssdw (or packusdw sse4.1)
punpcklbw+punpckhbw -> packuswb (packsswb does not work!)

If you assume that your input floats round to an integer in the interval [0,255], you will need to use the unsigned packed for the word->byte, but you don't have to worry about the sign for the dword->word convertsion. So it all happily fits in sse2.

Also, as I said before, it is probably faster to work with floats in your image algorithms, so you should follow:

Code:

1. convert bytes in .bmp to internal float representation
2. do all internal computations with floats
3. convert floats back to bytes in .bmp

10 Jul 2013, 13:52

jmcclane

Joined: 17 Feb 2013
Posts: 14

jmcclane 13 Jul 2013, 13:53

from up code...

....
xmm0 ; a0, b0, g0, r0 ; 4 floats
xmm1 ; a1, b1, g1, r1 ; 4 floats
xmm2 ; a2, b2, g2, r2 ; 4 floats
xmm3 ; a3, b3, g3, r3 ; 4 floats
.....

cvtps2dq xmm3, xmm3
cvtps2dq xmm2, xmm2
cvtps2dq xmm1, xmm1
cvtps2dq xmm0, xmm0

packssdw xmm2, xmm3
packssdw xmm0, xmm1

packuswb xmm0, xmm2 ; xmm0 = a3,b3,g3,r3 a2,b2,g2,r2 a1,b1,g1,r1 a0,b0,g0,r0 ; 16 bytes

I tried it and works great ...
Thank you very much for your help tthsqe!

Yet I will try to transpose color (matrix)... from up code to get..

....
movdqa xmm1, xmm0
punpcklwd xmm0, xmm7
cvtdq2ps xmm0, xmm0 ; a0, b0, g0, r0 ; 4 floats -> xmm0 = a0; a1; a2; a3
punpckhwd xmm1, xmm7
cvtdq2ps xmm1, xmm1 ; a1, b1, g1, r1 ; 4 floats -> xmm1 = b0; b1; b2; b3

movdqa xmm3, xmm2
punpcklwd xmm2, xmm7
cvtdq2ps xmm2, xmm2 ; a2, b2, g2, r2 ; 4 floats -> xmm2 = g0; g1; g2; g3
punpckhwd xmm3, xmm7
cvtdq2ps xmm3, xmm3 ; a3, b3, g3, r3 ; 4 floats -> xmm3 = r0; r1; r2; r3

...so any help is welcome

13 Jul 2013, 13:53

cod3b453

Joined: 25 Aug 2004
Posts: 618

cod3b453 14 Jul 2013, 10:35

The following will load the 16 bytes from xmm0 to xmm0-3 as transposed 32bit floats:

Code:

                                ; xmm0 = A3 B3 G3 R3 A2 B2 G2 R2 A1 B1 G1 R1 A0 B0 G0 R0
        movdqa xmm1,dqword [mask_000000FF000000FF000000FF000000FF]
                                ; xmm1 = 00 00 00 FF 00 00 00 FF 00 00 00 FF 00 00 00 FF

        movdqa xmm3,xmm0        ; xmm3 = A3 B3 G3 R3 A2 B2 G2 R2 A1 B1 G1 R1 A0 B0 G0 R0
        psrld xmm0,8            ; xmm0 = 00 A3 B3 G3 00 A2 B2 G2 00 A1 B1 G1 00 A0 B0 G0
        pand xmm3,xmm1          ; xmm3 = 00 00 00 R3 00 00 00 R2 00 00 00 R1 00 00 00 R0

        movdqa xmm2,xmm0        ; xmm2 = 00 A3 B3 G3 00 A2 B2 G2 00 A1 B1 G1 00 A0 B0 G0
        psrld xmm0,8            ; xmm0 = 00 00 A3 B3 00 00 A2 B2 00 00 A1 B1 00 00 A0 B0
        pand xmm2,xmm1          ; xmm3 = 00 00 00 G3 00 00 00 G2 00 00 00 G1 00 00 00 G0

        pand xmm1,xmm0          ; xmm1 = 00 00 00 B3 00 00 00 B2 00 00 00 B1 00 00 00 B0
        psrld xmm0,8            ; xmm0 = 00 00 00 A3 00 00 00 A2 00 00 00 A1 00 00 00 A0

        cvtdq2ps xmm3,xmm3      ; xmm3 = R3 R2 R1 R0
        cvtdq2ps xmm2,xmm2      ; xmm2 = G3 G2 G1 G0
        cvtdq2ps xmm1,xmm1      ; xmm1 = B3 B2 B1 B0
        cvtdq2ps xmm0,xmm0      ; xmm0 = A3 A2 A1 A0

        ; ...

        align 16
mask_000000FF000000FF000000FF000000FF:
        dd 0x000000FF,0x000000FF,0x000000FF,0x000000FF

With a little modification you can get the inverse alpha by xoring xmm0 and xmm1 into, say, xmm4 before the conversion.

14 Jul 2013, 10:35

jmcclane

Joined: 17 Feb 2013
Posts: 14

jmcclane 15 Jul 2013, 08:27

Thank's cod3b453!

15 Jul 2013, 08:27

< Last Thread | Next Thread >

Forum Rules:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum