flat assembler
Message board for the users of flat assembler.

Index > Windows > MMX/SSE newbie question

Author
Thread Post new topic Reply to topic
dannemanare



Joined: 20 Feb 2006
Posts: 3
dannemanare
Hi!
I'm new at this so please bear with me... Have been browsing for a good forum for some time now, thought I'd give this one a shot.

I'm building some computer vision algorithms and as part of these I construct the "integral image", that is starting with an original image take the cumulative sums in both directions, so that each pixel in the integral image is the sum of all pixels above and to the left of the corresponding pixel in the original image.

I thought this would be a nice application for vector operations and started looking for info on mmx/sse etc. Turned out it was a bit more tricky than I'd hoped. I'll get to the point:

I have one char array for the original image (rgbrgbrgb...) and a similar int array for the integral image. I want to read the 3 rgb char values of each pixel (and maybe a fourth 0 alpha padding element) into an sse register and add them to the 3 (4) corresponding integral image int pixel values. Processing the first line of the image could be something like this:

unsigned char im[];
unsigned int int_im[];

xmm0 = 0;
for (j=0; j<line_length; j+=4)
{
xmm1 = im[j:j+4];
xmm0 += xmm1;
int_im[j:j+4] = xmm0;
}


I've found some simple examples adding e.g. char arrays together, but none where you read char's into int slots, so to speak.

If someone could help me out, I'd really appreciate it!


Cheers,

Dan
Post 20 Feb 2006, 11:15
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
formatting the arrays w/ padding and getting your sizes right is key.
It will cost a bit in memory, but extending your char values to 32bit integers and make sure your interegers are 32bits will make your task much easier.

rgbP (P=padding char) only takes up 4 bytes (32bits), moving that into an sse variable than interleving the bytes into words then the words into dwords is a lot of over head.

If you don't have to do too many additions than you can use WORD size arrays (16bit) ints

Heres the SSE code for the following
array1 ;;char array format rgb0rgb0rgb0...
array2 ;;int16 array format 1230123012301230... each number represents 16bits for the int16, each letter represents 8bits for the char

I'll assume you want 32bit code (as opposed to 64bit)

Make sure there's 16 bytes of 0 padding at the end of each array
because these unaligned SSE reads will take more data than we will use
Code:
mov eax, array1 ;;move address of array1 into eax
mov edx, array2 ;;move address of array2 into edx
mov ecx, LINE_LENGTH_IN_DWORDS ;; ie line rgb0rgb0rgb0 = 3
pxor xmm0,xmm0 ;;xmm0 = 0
pxor xmm2,xmm2 ;;xmm2 = 0
.LabelLoop:
dec ecx ;; ecx--
js .LabelEnd ;; if ecx is < 0 end the loop
movq xmm1, qword[eax + ecx*4] ;;move an rgb0rgb0 into xmm1
;; the 2nd rgb0 will not be used if you knew that the lines
;; would have an even numbered length YOU COULD USE BOTH
punpcklbw xmm1,xmm2 ;;changes unsigned bytes to unsigned words
paddw xmm0,xmm1 ;;add the words
movq qword[edx + ecx*8], xmm0 ;;save to 16bit word array
jmp .LabelLoop
.LabelEnd:
    


Here's the version that does two rgb0 at the same time.
Use this version for better speed if you know before hand that
the line size will be an even number.

Code:
mov eax, array1 ;;move address of array1 into eax
mov edx, array2 ;;move address of array2 into edx
mov ecx, LINE_LENGTH_IN_DWORDS ;; ie line rgb0rgb0rgb0rgb0 = 4
pxor xmm0,xmm0 ;;xmm0 = 0
pxor xmm2,xmm2 ;;xmm2 = 0
.LabelLoop:
sub ecx,2
js .LabelEnd ;; if ecx is < 0 end the loop
movdqu xmm1, dqword[eax + ecx*8] ;;move an rgb0rgb0 into xmm1
punpcklbw xmm1,xmm2 ;;changes unsigned bytes to unsigned words
paddw xmm0,xmm1 ;;add the words
movdqu dqword[edx + ecx*16], xmm0 ;;save to 16bit word array
jmp .LabelLoop
.LabelEnd:
    


They are very similar, but the 2nd version will run faster and allows for some optimizations down the line. If you can't be sure of the line size use the first example and change all the xmm? to mm?, becuase there's no point in using 128 sse registers when you'll only need 64bit mmx registers.
Post 21 Feb 2006, 06:57
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
dannemanare



Joined: 20 Feb 2006
Posts: 3
dannemanare
Thank you! Looks great. I'll start messing around with it and if I get stuck I'll post again.

I better stick with 32bit depth in the integral image, though. int16 would overflow pretty quickly.
Post 21 Feb 2006, 08:43
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
Depends under what conditions the overflow occurs. You said you are integrating only four values: at pixel(x,y) you calculate your value from (x+1,y), (x-1,y), (x,y+1), (x,y-1) so only four characters and the maximum sum of them is 255*4=1020, but a word can hold 65535. You can have at least 257 of additions like these.
Although 32-bit is good because an average CPU today can read/write 32-bit values the fastest, while 16-bit reads/writes handle 32 bits together, but discard the other 16. Of course MMX/XMM are exceptions...

And this code above is only for demonstration of adding horizontal (or vertical) lines, but if you want to calculate on a grid, you should give the XMM regiter all these four co-ordinates. Don't worry, it can hold all of them together in one regiter Very Happy
Post 21 Feb 2006, 09:40
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
dannemanare



Joined: 20 Feb 2006
Posts: 3
dannemanare
Well, maybe I wasn't very clear about what I call the integral image. Or maybe I'm just missing your point Smile

A pixel in the integral image is the sum of all pixels above and to the left of that location in the original image. If you have a vga image, all white, and sum up the top scanline, the pixel at [0,639] will be (163200, ...). The bottom right pixel will have rgb values at 640*480*255 ~ 78e6. Kind of like a summed-area table (?)

Right now I read 16 chars at a time from the in image into an xmm register. Then I unpack those four times, using punpckXbw and punpckXwd with different cominations of low and high, to filter out each quadruple of rgba values into a 4x dword and add using paddd. (You guys could propably do that alot more elegant and efficient)

I guess the main problem is all the memory addressing I have to do - keeping a trailing pointer to the previous scanline and such. Really no way around that problem. Letting the compiler do all the optimization I might end up with hardly no improvement at all. I enjoy learning some assembler and sse, though Smile
Post 21 Feb 2006, 15:08
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on YouTube, Twitter.

Website powered by rwasa.