flat assembler
Message board for the users of flat assembler.
Index
> Main > 32bit rgba pixel  16bytes to 4 floats 
Author 

cod3b453
If this is getting called a lot it may help to move the loop into this procedure. Any instructions that might be better would require more recent SSSE3+ (pshufb)/AVX (vcvtdq2ps) instruction sets to be supported  are these available?. It may also be worth enforcing aligned buffers so you can use movdqa/movaps for load/stores and reordering the instructions to reduce interdependencies on the punpckxxx/cvtdq2ps stages.


09 Jul 2013, 16:34 

jmcclane
Thank's... but I want without ssse3+


09 Jul 2013, 16:45 

tthsqe
I couldn't get anything faster using <= sse2. It of course helps to have things aligned so that you can use aligned mov's.
Considering that the optimal sse4.1 solution Code: irps i, 0 1 2 3 { pmovzxbd xmm#i,[eax+4*i] cvtdq2ps xmm#i,xmm#i movaps [ecx+16*i],xmm#i } is only about 20% faster on my machine, I would say that you have hit the throughput limit of the sse instructions. If you need more speed and this function is the bottleneck, I would say get rid of the byte representation in your program and just stick with the 4x larger unpacked float representation. 

09 Jul 2013, 18:01 

jmcclane
Thank's a lot...I have old cpu and dont support sss3+...sse4
so I use sse2 or sse3 I can't test:( If anyone knows how to convert back to 16 bytes 

09 Jul 2013, 18:40 

tthsqe
Your solution for 4x byte > 4x float is fairly good if you stick to see2.
However, may I ask why your program needs a fast byte > float function? 

09 Jul 2013, 18:54 

jmcclane
I learn.... later to use for little faster effects
on images.....I use vb.net and fasm... my hobby that I love... 

09 Jul 2013, 19:20 

tthsqe
jmcclane,
to invert what you are doing, all you need to do reverse the instruction order and invert each operation. Code: cvtdq2ps > cvtps2dq punpcklwd+punpckhwd > packssdw (or packusdw sse4.1) punpcklbw+punpckhbw > packuswb (packsswb does not work!) If you assume that your input floats round to an integer in the interval [0,255], you will need to use the unsigned packed for the word>byte, but you don't have to worry about the sign for the dword>word convertsion. So it all happily fits in sse2. Also, as I said before, it is probably faster to work with floats in your image algorithms, so you should follow: Code: 1. convert bytes in .bmp to internal float representation 2. do all internal computations with floats 3. convert floats back to bytes in .bmp 

10 Jul 2013, 13:52 

jmcclane
from up code...
.... xmm0 ; a0, b0, g0, r0 ; 4 floats xmm1 ; a1, b1, g1, r1 ; 4 floats xmm2 ; a2, b2, g2, r2 ; 4 floats xmm3 ; a3, b3, g3, r3 ; 4 floats ..... cvtps2dq xmm3, xmm3 cvtps2dq xmm2, xmm2 cvtps2dq xmm1, xmm1 cvtps2dq xmm0, xmm0 packssdw xmm2, xmm3 packssdw xmm0, xmm1 packuswb xmm0, xmm2 ; xmm0 = a3,b3,g3,r3 a2,b2,g2,r2 a1,b1,g1,r1 a0,b0,g0,r0 ; 16 bytes I tried it and works great ... Thank you very much for your help tthsqe! Yet I will try to transpose color (matrix)... from up code to get.. .... movdqa xmm1, xmm0 punpcklwd xmm0, xmm7 cvtdq2ps xmm0, xmm0 ; a0, b0, g0, r0 ; 4 floats > xmm0 = a0; a1; a2; a3 punpckhwd xmm1, xmm7 cvtdq2ps xmm1, xmm1 ; a1, b1, g1, r1 ; 4 floats > xmm1 = b0; b1; b2; b3 movdqa xmm3, xmm2 punpcklwd xmm2, xmm7 cvtdq2ps xmm2, xmm2 ; a2, b2, g2, r2 ; 4 floats > xmm2 = g0; g1; g2; g3 punpckhwd xmm3, xmm7 cvtdq2ps xmm3, xmm3 ; a3, b3, g3, r3 ; 4 floats > xmm3 = r0; r1; r2; r3 ...so any help is welcome 

13 Jul 2013, 13:53 

cod3b453
The following will load the 16 bytes from xmm0 to xmm03 as transposed 32bit floats:
Code: ; xmm0 = A3 B3 G3 R3 A2 B2 G2 R2 A1 B1 G1 R1 A0 B0 G0 R0 movdqa xmm1,dqword [mask_000000FF000000FF000000FF000000FF] ; xmm1 = 00 00 00 FF 00 00 00 FF 00 00 00 FF 00 00 00 FF movdqa xmm3,xmm0 ; xmm3 = A3 B3 G3 R3 A2 B2 G2 R2 A1 B1 G1 R1 A0 B0 G0 R0 psrld xmm0,8 ; xmm0 = 00 A3 B3 G3 00 A2 B2 G2 00 A1 B1 G1 00 A0 B0 G0 pand xmm3,xmm1 ; xmm3 = 00 00 00 R3 00 00 00 R2 00 00 00 R1 00 00 00 R0 movdqa xmm2,xmm0 ; xmm2 = 00 A3 B3 G3 00 A2 B2 G2 00 A1 B1 G1 00 A0 B0 G0 psrld xmm0,8 ; xmm0 = 00 00 A3 B3 00 00 A2 B2 00 00 A1 B1 00 00 A0 B0 pand xmm2,xmm1 ; xmm3 = 00 00 00 G3 00 00 00 G2 00 00 00 G1 00 00 00 G0 pand xmm1,xmm0 ; xmm1 = 00 00 00 B3 00 00 00 B2 00 00 00 B1 00 00 00 B0 psrld xmm0,8 ; xmm0 = 00 00 00 A3 00 00 00 A2 00 00 00 A1 00 00 00 A0 cvtdq2ps xmm3,xmm3 ; xmm3 = R3 R2 R1 R0 cvtdq2ps xmm2,xmm2 ; xmm2 = G3 G2 G1 G0 cvtdq2ps xmm1,xmm1 ; xmm1 = B3 B2 B1 B0 cvtdq2ps xmm0,xmm0 ; xmm0 = A3 A2 A1 A0 ; ... align 16 mask_000000FF000000FF000000FF000000FF: dd 0x000000FF,0x000000FF,0x000000FF,0x000000FF 

14 Jul 2013, 10:35 

jmcclane
Thank's cod3b453!


15 Jul 2013, 08:27 

< Last Thread  Next Thread > 
Forum Rules:

Copyright © 19992020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.
Website powered by rwasa.