flat assembler
Message board for the users of flat assembler.
![]() |
Author |
|
cod3b453 09 Jul 2013, 16:34
If this is getting called a lot it may help to move the loop into this procedure. Any instructions that might be better would require more recent SSSE3+ (pshufb)/AVX (vcvtdq2ps) instruction sets to be supported - are these available?. It may also be worth enforcing aligned buffers so you can use movdqa/movaps for load/stores and reordering the instructions to reduce interdependencies on the punpckxxx/cvtdq2ps stages.
|
|||
![]() |
|
jmcclane 09 Jul 2013, 16:45
Thank's... but I want without ssse3+
|
|||
![]() |
|
tthsqe 09 Jul 2013, 18:01
I couldn't get anything faster using <= sse2. It of course helps to have things aligned so that you can use aligned mov's.
Considering that the optimal sse4.1 solution Code: irps i, 0 1 2 3 { pmovzxbd xmm#i,[eax+4*i] cvtdq2ps xmm#i,xmm#i movaps [ecx+16*i],xmm#i } is only about 20% faster on my machine, I would say that you have hit the throughput limit of the sse instructions. If you need more speed and this function is the bottleneck, I would say get rid of the byte representation in your program and just stick with the 4x larger unpacked float representation. |
|||
![]() |
|
jmcclane 09 Jul 2013, 18:40
Thank's a lot...I have old cpu and dont support sss3+...sse4
so I use sse2 or sse3 I can't test:( If anyone knows how to convert back to 16 bytes |
|||
![]() |
|
tthsqe 09 Jul 2013, 18:54
Your solution for 4x byte -> 4x float is fairly good if you stick to see2.
![]() However, may I ask why your program needs a fast byte -> float function? |
|||
![]() |
|
jmcclane 09 Jul 2013, 19:20
I learn.... later to use for little faster effects
on images.....I use vb.net and fasm... my hobby that I love... |
|||
![]() |
|
tthsqe 10 Jul 2013, 13:52
jmcclane,
to invert what you are doing, all you need to do reverse the instruction order and invert each operation. Code: cvtdq2ps -> cvtps2dq punpcklwd+punpckhwd -> packssdw (or packusdw sse4.1) punpcklbw+punpckhbw -> packuswb (packsswb does not work!) If you assume that your input floats round to an integer in the interval [0,255], you will need to use the unsigned packed for the word->byte, but you don't have to worry about the sign for the dword->word convertsion. So it all happily fits in sse2. Also, as I said before, it is probably faster to work with floats in your image algorithms, so you should follow: Code: 1. convert bytes in .bmp to internal float representation 2. do all internal computations with floats 3. convert floats back to bytes in .bmp |
|||
![]() |
|
jmcclane 13 Jul 2013, 13:53
from up code...
.... xmm0 ; a0, b0, g0, r0 ; 4 floats xmm1 ; a1, b1, g1, r1 ; 4 floats xmm2 ; a2, b2, g2, r2 ; 4 floats xmm3 ; a3, b3, g3, r3 ; 4 floats ..... cvtps2dq xmm3, xmm3 cvtps2dq xmm2, xmm2 cvtps2dq xmm1, xmm1 cvtps2dq xmm0, xmm0 packssdw xmm2, xmm3 packssdw xmm0, xmm1 packuswb xmm0, xmm2 ; xmm0 = a3,b3,g3,r3 a2,b2,g2,r2 a1,b1,g1,r1 a0,b0,g0,r0 ; 16 bytes I tried it and works great ... Thank you very much for your help tthsqe! Yet I will try to transpose color (matrix)... from up code to get.. .... movdqa xmm1, xmm0 punpcklwd xmm0, xmm7 cvtdq2ps xmm0, xmm0 ; a0, b0, g0, r0 ; 4 floats -> xmm0 = a0; a1; a2; a3 punpckhwd xmm1, xmm7 cvtdq2ps xmm1, xmm1 ; a1, b1, g1, r1 ; 4 floats -> xmm1 = b0; b1; b2; b3 movdqa xmm3, xmm2 punpcklwd xmm2, xmm7 cvtdq2ps xmm2, xmm2 ; a2, b2, g2, r2 ; 4 floats -> xmm2 = g0; g1; g2; g3 punpckhwd xmm3, xmm7 cvtdq2ps xmm3, xmm3 ; a3, b3, g3, r3 ; 4 floats -> xmm3 = r0; r1; r2; r3 ...so any help is welcome |
|||
![]() |
|
cod3b453 14 Jul 2013, 10:35
The following will load the 16 bytes from xmm0 to xmm0-3 as transposed 32bit floats:
Code: ; xmm0 = A3 B3 G3 R3 A2 B2 G2 R2 A1 B1 G1 R1 A0 B0 G0 R0 movdqa xmm1,dqword [mask_000000FF000000FF000000FF000000FF] ; xmm1 = 00 00 00 FF 00 00 00 FF 00 00 00 FF 00 00 00 FF movdqa xmm3,xmm0 ; xmm3 = A3 B3 G3 R3 A2 B2 G2 R2 A1 B1 G1 R1 A0 B0 G0 R0 psrld xmm0,8 ; xmm0 = 00 A3 B3 G3 00 A2 B2 G2 00 A1 B1 G1 00 A0 B0 G0 pand xmm3,xmm1 ; xmm3 = 00 00 00 R3 00 00 00 R2 00 00 00 R1 00 00 00 R0 movdqa xmm2,xmm0 ; xmm2 = 00 A3 B3 G3 00 A2 B2 G2 00 A1 B1 G1 00 A0 B0 G0 psrld xmm0,8 ; xmm0 = 00 00 A3 B3 00 00 A2 B2 00 00 A1 B1 00 00 A0 B0 pand xmm2,xmm1 ; xmm3 = 00 00 00 G3 00 00 00 G2 00 00 00 G1 00 00 00 G0 pand xmm1,xmm0 ; xmm1 = 00 00 00 B3 00 00 00 B2 00 00 00 B1 00 00 00 B0 psrld xmm0,8 ; xmm0 = 00 00 00 A3 00 00 00 A2 00 00 00 A1 00 00 00 A0 cvtdq2ps xmm3,xmm3 ; xmm3 = R3 R2 R1 R0 cvtdq2ps xmm2,xmm2 ; xmm2 = G3 G2 G1 G0 cvtdq2ps xmm1,xmm1 ; xmm1 = B3 B2 B1 B0 cvtdq2ps xmm0,xmm0 ; xmm0 = A3 A2 A1 A0 ; ... align 16 mask_000000FF000000FF000000FF000000FF: dd 0x000000FF,0x000000FF,0x000000FF,0x000000FF |
|||
![]() |
|
jmcclane 15 Jul 2013, 08:27
Thank's cod3b453!
|
|||
![]() |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.