flat assembler
Message board for the users of flat assembler.
Index
> Windows > MMX/SSE newbie question |
Author |
|
r22 21 Feb 2006, 06:57
formatting the arrays w/ padding and getting your sizes right is key.
It will cost a bit in memory, but extending your char values to 32bit integers and make sure your interegers are 32bits will make your task much easier. rgbP (P=padding char) only takes up 4 bytes (32bits), moving that into an sse variable than interleving the bytes into words then the words into dwords is a lot of over head. If you don't have to do too many additions than you can use WORD size arrays (16bit) ints Heres the SSE code for the following array1 ;;char array format rgb0rgb0rgb0... array2 ;;int16 array format 1230123012301230... each number represents 16bits for the int16, each letter represents 8bits for the char I'll assume you want 32bit code (as opposed to 64bit) Make sure there's 16 bytes of 0 padding at the end of each array because these unaligned SSE reads will take more data than we will use Code: mov eax, array1 ;;move address of array1 into eax mov edx, array2 ;;move address of array2 into edx mov ecx, LINE_LENGTH_IN_DWORDS ;; ie line rgb0rgb0rgb0 = 3 pxor xmm0,xmm0 ;;xmm0 = 0 pxor xmm2,xmm2 ;;xmm2 = 0 .LabelLoop: dec ecx ;; ecx-- js .LabelEnd ;; if ecx is < 0 end the loop movq xmm1, qword[eax + ecx*4] ;;move an rgb0rgb0 into xmm1 ;; the 2nd rgb0 will not be used if you knew that the lines ;; would have an even numbered length YOU COULD USE BOTH punpcklbw xmm1,xmm2 ;;changes unsigned bytes to unsigned words paddw xmm0,xmm1 ;;add the words movq qword[edx + ecx*8], xmm0 ;;save to 16bit word array jmp .LabelLoop .LabelEnd: Here's the version that does two rgb0 at the same time. Use this version for better speed if you know before hand that the line size will be an even number. Code: mov eax, array1 ;;move address of array1 into eax mov edx, array2 ;;move address of array2 into edx mov ecx, LINE_LENGTH_IN_DWORDS ;; ie line rgb0rgb0rgb0rgb0 = 4 pxor xmm0,xmm0 ;;xmm0 = 0 pxor xmm2,xmm2 ;;xmm2 = 0 .LabelLoop: sub ecx,2 js .LabelEnd ;; if ecx is < 0 end the loop movdqu xmm1, dqword[eax + ecx*8] ;;move an rgb0rgb0 into xmm1 punpcklbw xmm1,xmm2 ;;changes unsigned bytes to unsigned words paddw xmm0,xmm1 ;;add the words movdqu dqword[edx + ecx*16], xmm0 ;;save to 16bit word array jmp .LabelLoop .LabelEnd: They are very similar, but the 2nd version will run faster and allows for some optimizations down the line. If you can't be sure of the line size use the first example and change all the xmm? to mm?, becuase there's no point in using 128 sse registers when you'll only need 64bit mmx registers. |
|||
21 Feb 2006, 06:57 |
|
dannemanare 21 Feb 2006, 08:43
Thank you! Looks great. I'll start messing around with it and if I get stuck I'll post again.
I better stick with 32bit depth in the integral image, though. int16 would overflow pretty quickly. |
|||
21 Feb 2006, 08:43 |
|
Madis731 21 Feb 2006, 09:40
Depends under what conditions the overflow occurs. You said you are integrating only four values: at pixel(x,y) you calculate your value from (x+1,y), (x-1,y), (x,y+1), (x,y-1) so only four characters and the maximum sum of them is 255*4=1020, but a word can hold 65535. You can have at least 257 of additions like these.
Although 32-bit is good because an average CPU today can read/write 32-bit values the fastest, while 16-bit reads/writes handle 32 bits together, but discard the other 16. Of course MMX/XMM are exceptions... And this code above is only for demonstration of adding horizontal (or vertical) lines, but if you want to calculate on a grid, you should give the XMM regiter all these four co-ordinates. Don't worry, it can hold all of them together in one regiter |
|||
21 Feb 2006, 09:40 |
|
dannemanare 21 Feb 2006, 15:08
Well, maybe I wasn't very clear about what I call the integral image. Or maybe I'm just missing your point
A pixel in the integral image is the sum of all pixels above and to the left of that location in the original image. If you have a vga image, all white, and sum up the top scanline, the pixel at [0,639] will be (163200, ...). The bottom right pixel will have rgb values at 640*480*255 ~ 78e6. Kind of like a summed-area table (?) Right now I read 16 chars at a time from the in image into an xmm register. Then I unpack those four times, using punpckXbw and punpckXwd with different cominations of low and high, to filter out each quadruple of rgba values into a 4x dword and add using paddd. (You guys could propably do that alot more elegant and efficient) I guess the main problem is all the memory addressing I have to do - keeping a trailing pointer to the previous scanline and such. Really no way around that problem. Letting the compiler do all the optimization I might end up with hardly no improvement at all. I enjoy learning some assembler and sse, though |
|||
21 Feb 2006, 15:08 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.