flat assembler
Message board for the users of flat assembler.
 flat assembler > Projects and Ideas > Efficient float-based SSE code 4 sum of squared differences
Author
Justus73

Joined: 28 Mar 2009
Posts: 9

# Efficient float-based SSE code 4 sum of squared differences

Hi board,

as a brand new member of this board I would like to first thank you all. Before having registered (today) I read your posts, ideas and projects for some years now, and I enjoyed it a lot.

Currently, I rewrite an older (c++ with inline assembler) project of mine which makes excessively use of a simple matrix error calculation (2 dimensions), which is nothing but e = sum(a[x,y]-b[i,j])² for a rectangular area in a and b. That calculation is the "key time consumer" in the whole code. As I didn't code (SSE) assembly for a longer time now, I think my implementation is far from being optimal or efficient. So, I would kindly like to request your suggestions for optimizations.

Short C-code implementation what I am generally doing
 Code: float getErrorC(     int channelIndex,      int srcX,      int srcY,      int txtX,      int txtY,      int width,      int height,      float maxError) {     // channel elements are floats     IplImage *pSrcChannel = this->pSource->getChannel(channelIndex);     IplImage *pTxtChannel = this->pTexture->getChannel(channelIndex);     float error = 0.0;     int x, y;     float *pSrcData;     float *pTxtData;     float dif;     for (y=0; yimageData +            ((srcY + y) * pSrcChannel->widthStep)) +            (srcX * sizeof(float));       pTxtData = (float *) (pTxtChannel->imageData +            ((txtY + y) * pTxtChannel->widthStep)) +            (txtX * sizeof(float));              for (x=0; x maxError) {         return -1.0f;       }     }     return error; }

And this is my old SSE-implementation I want re-write in FASM, pls excuse me to show it as inline code... (with focus on the __asm block)

On my Q6600 the performance dramatically drops if I create more specialized code (e.g. a block for each remaining elements case from one to three elements). I suppose the code block then doesn't fit into instruction cache anylonger. So what I need would be a good compromise between code size and performance (that's why I only add up 8 elements in main loop as I thought that's a good idea).

What do you think?
28 Mar 2009, 10:02

Joined: 25 Sep 2003
Posts: 2146
Location: Estonia
1) replace movups with movaps (ICC will align pointers to m128 automatically, but I don't know how other compilers behave... always use caution)
2) I could really use some sample data and the size of it to make the optimizations. Is it in MB?
3) I would really like the code to be clearer to optimize more I don't know what it does, what its goals are...

 Code: movaps xmm0, [esi + 00]     // read src[0..3]     movaps xmm2, [esi + 16]     // read src[4..7]     subps xmm0, [edi + 00]      // you don't need to cache the destination     subps xmm2, [edi + 16]      // just subtract          add esi, 8*4                // make assembly resemble the C-code     add edi, 8*4                // src += 8 * sizeof(float), which is 4 [---]     // we need to process 4 pels remainers:     movaps xmm0, [esi + 00]     // read src[0..3]     subps xmm0, [edi + 00]      // read txt[0..3] //;There are more movUps down the code...

This should make your code upto twice as fast because even if data is aligned and you used unaligned loads on older CPUs (Q6600), then you hit a penalty.

Its no excuse to use movups even when on a Core i7
30 Mar 2009, 13:44
bitRAKE

Joined: 21 Jul 2003
Posts: 2624
Location: dank orb
The code cache is 32k - this code is less than 1k.

What is the likelihood of MaxError being exceeded?
30 Mar 2009, 17:13
Justus73

Joined: 28 Mar 2009
Posts: 9
Okay, let me explain further:

The code is part of a texture transfer tool. It is based upon Efros & Freeman's SIGGRAPH 2001 paper "Image Quilting for Texture Synthesis and Transfer" (from University of California Berkeley and Mitsubishi Electric Research Laboratories).

In short words explained:
Imagine we have two images, a "source image" and a "texture image". Furthermore, imagine we choose a small "transfer" area, let's say of 32x32 pixels.

The general idea behind the "texture transfer" is, to now move the "transfer" area over the source image in some steps (normally from top left to bottom right), always looking for the "best matching area" from "texture image" in order to "look like" the transfer area in the source image. The "thing that makes it work" is, that the transfer-areas of two following iterations "overlap" each other, for example by 1/3rd. In order to "not only create a similar-looking puzzle" from "source image" using puzzle-tiles from "texture image", the overlapping areas will be "painted" in a special way. Let's think about the vertical overlapping area of two "transfer areas", which would be of size 10x32 pixels (for example): we then look for that "path from y=0 to y=31" where the error between the texture and the current state is at it's minimum. Think about it, as if each "step" downwards (which can be either downleft-, down-, downright-wards) has a "cost" which is equal to (source[x, y] - texture[i, j])². The "right path" is then the path where the sum of those "costs of the steps" is minimum.

That path then is our "cut line to cut along", so we don't copy a rectangular block but along those paths in order to minimize optical errors between current image state and texture. This is done for the left and top edge of each block. Further, the image can be "transferred" in multiple iterations, reducing the transfer area's size from iteration to iteration. There is also a mechanism which allows a better preservation of the texture characteristics which have "yet been transported" (see the alpha formula on page 5 of the PDF), but that's not that relevant for the calculation.

That is, in short words, what my tool does. I hope, I explained it more or less understandable, and I hope the PDF helps to imagine what's all about - otherwise - please ask.

The "key CPU-time consumer" is to calculate the error between a source image block (NxM pixels large) and a texture image block (also NxM pixels large). There are many possibilities to "choose" a "block" from the texture image, for example by simply randomizing some (valid) X/Y positions in the texture image, calculate the errors between the source area and all (randomized) texture areas and use the one with the smallest error. Or one can iterate over all possible areas in the texture image - whatever one wants.

The code I posted is my "calculate the error between the luminosity channel of the source image (starting at position sx,sy) and the texture's luminosity channel of the texture image (starting at tx, ty).

"Normally", the transfer area has a size between 4 and 128 pixels, depending on the source image to be transferred, resulting in 4x4 up to 128x128 floating point error calculations "per try to locate a good match in the texture". In order to give you some numbers: you can assume you find an "acceptable looking" match with at least(!) 1.000 tries per transfer area, a "good looking" can be found with at least 10.000 tries per transfer area, assuming a transfer area of 32x32 pixels size and a texture of 512x384 pixels. As you already assume, the larger the texture the more tries one should start to find a good match "by random". To stay in the example with the 512x384 pixels wide "texture image" and 32x32 pixels transfer area: finding the best match would require (512-32+1)x(384-32+1) tries, means 169.793 tries. But, in fact, one gets already good result checking a 1/10th of the possible blocks.

The idea of using movaps instead of movups would require me to store (as well the texture as well the source image) four times, one time with element 0,4,8,12... at a 16-byte boundary, then with element 1, 5, 9, 13... at a 16-byte boundary and so on. Assuming source images up to 4096x3072 pixels, the memory requirement for a single instance is already

 Code: (4096 x 3072 pixels) x (4 channels Lum, R, G, B) x (4 bytes per float) = 201.326.592 = 192 MiBs.

I did not yet think about holding image channel data four times, but I think maybe it's worth a try, "twice as fast" sounds promising =)
Sample data is possible and no problem, but, in fact, you can use any two grayscale-image's pixelvalues and convert them to interval [0.0f, 1.0f] applying 1/255th as factor. If you want to, I can create some source code fragment with a whole bunch of "df's" to be includable via "file" directive, then forming the texture and source image in constant data definitions...

To make the stuff a bit more clear: think of the two images of size
 Code: textureWidth x textureHeight
and
 Code: sourceWidth x sourceHeight
as arrays of floats. The code does nothing more but

 Code: error = 0 // let (sx, sy) specify the top left point in source image // let (tx, ty) specify the top left point in texture image // whereas width and height of both areas is equal for (y = 0 upto transferAreaHeight-1) {   for (x = 0 upto transferAreaWidth-1) {     luminosityDifference = sourceImage[sx+x, sy+y] - textureImage[tx+x, ty+y]     error = error + (luminosityDifference * luminosityDifference)   }   if (error >= maxError) {     // cancel, cannot be a new "best match"     return -1   } } // looks like a new best match candidate: return error

The code says "looks like a new best match candidate" in the last row of pseudo-code, because due to the "read only" access to the images, the code can be executed multithreaded, allowing N threads look up "their" best match... choosing the best one of the N results in the last step. The advantage of "synchronizing the best match's error values" over the threads is very low, so it's not worth "sharing" a new best match's error over all threads - each one has it's own private "best match error value = maxError", so to say.

@bitRAKE
As explained in the pseudo-code above: the code is being executed in order to "check" whether that area is a "new best match" candidate. That means, if we already know the error of the "current best match" we can cancel further calculations as soon as we exceed that border, because the candidate cannot be a new best match anylonger. That's the reason why to check for not yet exceeded maxError from time to time.

Little statistic about the "maxError-cancellation" mechanism: 85% of all error calculations are cancelled before finishing 75% of the calculation. That, of course, depends on "when hitting the best candidate". The sooner one hits the best candidate, the more calculations are canceled "sooner".

32k instruction cache? Per core as I see? Fuck, I have definitly missed some CPU-generations... and I thought I would have to split up the loop into smaller pieces in any case... but 32k should be enough to hold, let's say 5 specialized calculation loops, with 8 elements per loop executed "width / 8"-times, and then "the remaining elements" with a 4 elements per loop code, 3 elements per loop, two elements per loop and a single element per loop, so I cover any possible permutation of "width MOD 8".

At the moment, I try to understand the data prefetching instructions, due to knowing "what scanline-part" will be calculated next I hope to save some further cycles, but I did not yet check what gain I can expect from that.

Please don't be impatient if my reaction time is poor, my wife just told me we got to babysit her brother's 4 little daughters from thursday to sunday (NOOOooooo.....)
Requiring some preparations, leaving me little time for my fun projects...

Some final words: tonite, when I could not get some sleep due to too much coffee because being too tired because of too few sleep (I know... I know...) I thought about the idea "slicing" up the area into columns, having a "top to bottom" calculation... for example, calculating 15x15 pixels in the following way:

1) calculate 8 elements/loop from top to bottom
2) increment start addresses by 8 elements
3) calculate 4 elements/loop from top to bottom
4) increment start addresses by 4 elements
5) calculate 3 elements/loop from top to bottom

The advantage would be, that I can write multiple codes for different block widths, and chain them together so they calculate "stripewise"... the code-size of the loops would be small and branch hinting (does that really save time?) would be easy to guess... but don't know how much of the iteration costs (3 x "loop iteration costs" instead of 1 x) eat up the performance gain by better instruction distribution on different registers.
30 Mar 2009, 20:49
bitRAKE

Joined: 21 Jul 2003
Posts: 2624
Location: dank orb
Interesting paper.

Are you really doing this on HDR data? Maybe it's better to operate on the integer data - there is a sum of absolute difference instruction.
31 Mar 2009, 06:05

Joined: 25 Sep 2003
Posts: 2146
Location: Estonia
You can still use aligned loads when the offsets aren't aligned. You just need alignment code on both ends.

 Code: Quoting from Intel manual 248966.pdf (Optimization Reference) Throughput (cycles):                         Core i7         45nm Core2      65nm Core2 Alignment Scenario      06_1AH          06_17H          06_0FH 16B aligned &             1               2               2 Not-16B aligned, not cache split           1              ~2              ~2 Split \$-line boundary    ~4.5           ~20             ~20

What I propose is that when unaligned hits you a 20cycle penalty in L1 cache, then you have about 19 cycles to align your code before the loop starts.
By the time the 20cycles ends you might have already loaded 16 bytes in another way:
 Code: ;movups xmm0, [esi + 00]     // This would cost us about 20cycles penalty   mov     eax,esi   and     esi,not 0Fh   movaps  xmm0,[esi + 00]   sub     eax,esi   jz      .aligned   movaps  xmm1,[esi + 16]   cmp     eax,4   jne     @f   palignr xmm1,xmm0,4 @@:   cmp     eax,8   jne     @f   palignr xmm1,xmm0,8 @@:   cmp     eax,12   jne     @f   palignr xmm1,xmm0,12 @@:;                            //This alternative would cost roughly 10cycles   ;Here xmm1 has the first 16 bytes [---] .aligned:   ;xmm0 has the first 16 bytes [---]

_________________
My updated idol http://www.agner.org/optimize/
31 Mar 2009, 07:47
Justus73

Joined: 28 Mar 2009
Posts: 9
First of all thank you for you suggestions - your two information allow me to try and optimize a lot, I guess the combination of both your suggestions gonna rock... gonna try those as soon as I can...

 Quote: You can still use aligned loads when the offsets aren't aligned. You just need alignment code on both ends.

Damn - you're so right... and I even already used that - during my working time in the gaming industry I wrote an excessively high optimized "isometric tile"-painting routine which used specially implemented copy-instructions for the leading and trailing "misaligned" parts... one for each possible misaligned "source address" and "destination address" combination (scanline-wise copying using jump-tables to look corresponding copy-method up)... that was on Pentium II... god, I'm getting old... really forgot that elementar knowledge... óO

@bitRAKE
 Quote: there is a sum of absolute difference instruction

Thanks - completely missed that one - as far as I just saw, the operation works on 16- or even 32-bytes at a time. Hmmm... that would allow a lot of things... for example, using the SAD (sum of abs. distances) for "fast identifying candidates", and then using the preciser SSD (sum of squared distances) on those candiates only. I would'nt lose much if I switch back to byte data, except I need to ensure that error calculations on blocks whose theoretical maximum errors exceed 255x256 caused by (width x height x 255 > 255x256) are subdivided into smaller partwise calculations, but that's easy...
31 Mar 2009, 10:22
rugxulo

Joined: 09 Aug 2005
Posts: 2124
Location: Usono (aka, USA)
 Madis731 wrote: 1) replace movups with movaps (ICC will align pointers to m128 automatically, but I don't know how other compilers behave... always use caution) ... This should make your code up to twice as fast because even if data is aligned and you used unaligned loads on older CPUs (Q6600), then you hit a penalty.

Is he using GCC?

 gcc.info wrote: `aligned (ALIGNMENT)' This attribute specifies a minimum alignment for the variable or structure field, measured in bytes. For example, the declaration: int x __attribute__ ((aligned (16))) = 0; causes the compiler to allocate the global variable `x' on a 16-byte boundary. On a 68040, this could be used in conjunction with an `asm' expression to access the `move16' instruction which requires 16-byte aligned operands. You can also specify the alignment of structure fields. For example, to create a double-word aligned `int' pair, you could write: struct foo { int x[2] __attribute__ ((aligned (8))); }; This is an alternative to creating a union with a `double' member that forces the union to be double-word aligned.

 Madis731 wrote: Its no excuse to use movups even when on a Core i7 :D

If he wants to use SSE3, he can use LDDQU:

 Wikipedia wrote: [SSE3 adds] LDDQU, an alternative misaligned integer vector load that has better performance on NetBurst architectures for loads that cross cacheline boundaries. It can be helpful for video compression tasks.

 Justus wrote: Please don't be impatient if my reaction time is poor, my wife just told me we got to babysit her brother's 4 little daughters from thursday to sunday (NOOOooooo.....) Requiring some preparations, leaving me little time for my fun projects...

I'm sure they're so cute that you won't even mind. :D
01 Apr 2009, 12:01
Justus73

Joined: 28 Mar 2009
Posts: 9
 Quote: Is he using GCC?

Currently, I try to keep the code compileable using GCC and VS2005, and the latter uses the syntax __declspec(align(16)) or #pragma pack(16), which does the same.

 Quote: If he wants to use SSE3, he can use LDDQU

Thanks for the hint, I'm gonna give it a try when I have the necessary time..

 Quote: I'm sure they're so cute that you won't even mind

Well, they are very cute... but as my parent's only child I'm not used to three little, and loud, girls... strains nerves... but... =)
01 Apr 2009, 19:04
Justus73

Joined: 28 Mar 2009
Posts: 9

# Work is ongoing

Hi mites,

for those who are interested: work is going on.

How things currently look alike:

- decided to use ImageCV for image IO-purposes
- ImageCV uses 32-byte-alignment for channel data (which fits SSE code-movaps)
- I "enhance" image width so next scanline always starts on adress (A % 16 bytes == 0)
- wrote a code generator which creates 256-methods for checking for a new "best match" candidate (16 address offsets for texture x 16 address offsets for source image)
- meaning, I use two "movaps" and corresponding "pslldq" and "psrldq" for reading "16-bytes at a time, with no penalties at all"
- using "psadbw" for checking 16 pixels at a time using the luminosity channel, masking out "undesired" pixels

All of your suggestions (not using 32-bit floats but 8-bit data instead, handling 16 pixels at a time) caused my code speeding up roundabout factor 4. Using 256 specialized methods for reading and processing data caused another speedup with roundabout factor 3.

Thank you all for your suggestions! My current runtime is roundabout 1/12 of the "initial" runtime, and I'm still not done optimizing!
15 Apr 2009, 14:36

Joined: 25 Sep 2003
Posts: 2146
Location: Estonia
Did you try what is the effect between PS**DQ vs. PALIGNR. They should issue in the same time, but you never know . Maybe its just easier to use shifts algorithmically, because PALIGNR works only one way (right ). It would be curious how it affects the performance because it was designed for the specific cases you talk about.

It would be beneficial in a case like this:
 Code: ;A) PS**DQ movdqa  xmm0,[source_mem] psrldq  xmm0,constant_a ;shift data out pslldq  xmm0,constant_a ;bring back zeros movdqa  xmm1,[source_mem+16] pslldq  xmm1,constant_b ;shift to the other wall psrldq  xmm1,constant_b ;bring back zeros ;Now we have xmm0= 00000AAA and xmm1=BBBBB000 for example por     xmm0,xmm1 ; BBBBBAAA is merged ;B) PALIGNR movdqa  xmm0,[source_mem+16] ;Second 16 bytes palignr xmm0,[source_mem],constant_a/8 ; (second+first) shr bits/8 ;Now we have xmm0= BBBBBAAA for example

Just some data I gathered:
1) On 65nm PSLL/RLDQ is 2-clock instruction
2) On 45nm PSLL/RLDQ is 1-clock instruction
3) On 65nm PALIGNR is 2-clock instruction
4.a) On 45nm PALIGNR is 2- or 3-clock instruction if it acts on MMX registers
4.b) On 45nm PALIGNR is 1-clock instruction if it acts on SSE registers
So just maybe you can benefit if your program fits in this scenario

PS. Maybe you meant 'mate', not the person pictured here: http://www.camden.rutgers.edu/%7Ebwhitlow/AMULET/webpagestuff/dads_dust_mite.jpg but hey, sometimes ASM-code is so small that we resemble 'em
15 Apr 2009, 16:17
Justus73

Joined: 28 Mar 2009
Posts: 9

# Good news and some C-code voodoo

Okay, good news is: palignr is about 30% faster than the pslldq/psrldq/por-solution. Bad news is: Visual Studio compiler is heavily pissing me off - probably I'm simply too tired, but I implemented a simple C-version of the "compute 16 absolute differences per iteration" SSE-code... I wanted to compare the speed to the SSE-versions. And what I got is a more or less constant runtime duration of the C-code... means independently how often I call the C-implementation - the runtime is always the same... think I'm missing some point... sounds like a bug to find easily, but... console output tells me how often the method was called and what the calculated error is - both is correct... so... I better get some more sleep and investigate the C-code voodoo effect. But nevertheless, the SSE-code can be compared, so let's have a look:

 Code: ; Executing methods 1 time with 1.000.000 iterations each palign dif is 16000000, took 0.001828 secs pslldq dif is 16000000, took 0.002719 secs cdif   dif is 16000000, took 0.012741 secs

Well, that's okay, as it seems plausible. Now, in order to check the "jump-table jump" overhead in the SSE-variants, I raised method-call repetitions to 1000:

 Code: ; Executing methods 1.000 times with 1.000.000 iterations each palign dif is 16000000, took 1.772620 secs pslldq dif is 16000000, took 2.728955 secs cdif   dif is 16000000, took 0.012705 secs

See the problem? Calling the C-implementation 1.000 times is (in total!) slightly faster than calling it once. Okay, no problem, I thought, I'm obviously having some bug in the outer loop. Looks like the C-method is always only called once. So I inserted a printf-instruction to see how many calls the method passes. And there were 1.000 outputs the method was completely executed. And the result of each of the 1.000 executions is correct, as each byte differs by 1. And 1.000.000 iterations times 16 bytes at a time times 1 difference is 16.000.000, which was reported 1.000 times correctly.

Currently that voodoo-effect makes me sick. I already generated an assembly of the C-code and it looks okay and right, a whole bunch of movzx, 16 short jumps for correcting each difference's sign if necessary, then the "per iteration"-loop... looks okay. But... independent how often I call the C-method in the outer loop: the execution time seems to be constant. Maybe there's something wrong with my query performance counter / frequency time measurement, but I don't think so as the C-code is nearly instantly done, whereas the SSE-code really takes one, two seconds until it finishes.

Whatever that C-code voodoo-effect is, when I found the bug I'll tell comparison results between the C-code and the SSE-versions. Let's go on with the SSE-code which seem to behave "as expected", having some more or less linear CPU-consumption, fuck that voodoo until I've had some more sleep ^^

For a better imagination how the executed SSE-code looked, I copied two exemplaric loops of both SSE and the C-code solution. All assume, that we try to operate on two addresses, 'A' and 'B', whereas address of 'A' is (A MOD 16 == 1) and address of 'B' is (B MOD 16 == 2), resulting in A to be assembled from 15+1 bytes and B to be assembled from 14+2 bytes.

 Code: ; let's have shifting fun: absdif_01_02: movaps xmm0, [esi]      ; first part of A movaps xmm1, [edi]      ; first part of B movaps xmm2, [esi + 16] ; second part of A movaps xmm3, [edi + 16] ; second part of B psrldq xmm0, 1          ; A: right shift out 1 byte to be taken from second part psrldq xmm1, 2          ; B: right shift out 2 byte to be taken from second part pslldq xmm2, 15         ; A: shift desired byte to be inserted leftmost pslldq xmm3, 14         ; B: shift desired bytes to be inserted leftmost paddb xmm0, xmm2        ; merge A paddb xmm1, xmm3        ; merge B pand xmm0, xmm7         ; mask out unwanted bytes (width < 16) add esi, eax            ; increment A ptr pand xmm1, xmm7         ; mask out unwanted bytes (width < 16) add edi, edx            ; increment B ptr psadbw xmm0, xmm1       ; get abs difference paddq xmm6, xmm0        ; and add to global result dec ecx                 ; loop jne absdif_01_02

Here the generated code using your requested palign-solution:

 Code: absdifpar_01_02: movdqa xmm0, [esi+16]   ; prefill with second part of A movdqa xmm1, [edi+16]   ; prefill with second part of B palignr xmm0, [esi], 1  ; fill in first part of A -> done palignr xmm1, [edi], 2  ; fill in first part of B -> done pand xmm0, xmm7         ; A: mask out unwanted bytes (width < 16) add esi, eax            ; increment A ptr pand xmm1, xmm7         ; B: mask out unwanted bytes (width < 16) add edi, edx            ; increment B ptr psadbw xmm0, xmm1       ; get abs difference paddq xmm6, xmm0        ; add to global var dec ecx                 ; loop jne absdifpar_01_02

As you already mentioned, the palignr perfectly fits due to the "we must insert leftmost bytes shifting prefilled value right".

Well, as I already mentioned, the palignr-solution is about 30% faster... and as I just realized, I have two instructions on the same register following immediately each other - that psadbw and the paddq... another point I have to check when I've had some more sleep... but as the code-generator does the work changing all 256 methods for me, let's have a nice dream...

PS: sorry about that mItes and mAtes confusion - of course I didn't want to call you insects =)

n8, mates *g*
17 Apr 2009, 16:17

Joined: 25 Sep 2003
Posts: 2146
Location: Estonia
no need to be sorry, these confusions make people laugh. At least I got mine for that day.

Anyway, to shed some light on your compiler misbehaviour. Imagine this code:
 Code: long long sum=0; for(long long x=0;x<1000000;x++)     sum+=x; printf("%lld\n",sum);

This will print you 500000500000 as you would expect, but the code it generates will be something like this:
 Code: mov  rcx,\$fmt_str_packet_1   mov  rdx,500000500000   call printf

Hold on! Where's the loop?

The loop magically appears if you put a printf() every time you add to the sum. Because then it has to work. Compilers are very good in optimizing everything out.

Once I tried some SSE functions and my Intel compiler "optimized" them out because I didn't use their results - but I just wanted to see the code it generates

I went great lengths to get the SSE-code to appear in assembly, most of the time it just pre-calculated the value (because I gave it constants in the code ) and the code was as short as mov reg,value
17 Apr 2009, 19:48
Justus73

Joined: 28 Mar 2009
Posts: 9
Once more you were right, Madis... the compiler optimized away the main work, creating an array of absolute distances and summing up those elements... the movzx-instructions I saw only created the array...

I think it's a stupid idea to initialize an only 18-bytes wide array with arr[i] = i, and constantly pass &arr[1] and &arr[2] to the test methods, using a constant address increment of 0 bytes per loop. The compiler did not optimize away the loop, but he 'precalculated' the array of absolute distances and stored that in an 32-bit value array, from which he summed up. The mozx-instructions I mentioned initialized that array.

When I changed the test to "real-life" scenario, loading a hudge dimensioned image and let all methods operate on the luminosity channel, the compiled code and execution times reflected the real CPU costs of the C-code...

So, here's the "realistic" result of the performance comparison between C-code and SSE-codes...

 Code: cdif   dif is 5061993, took 6.199838933762 secs palign dif is 5061993, took 0.247584427027 secs pslldq dif is 5061993, took 0.369428369240 secs
18 Apr 2009, 08:43
Justus73

Joined: 28 Mar 2009
Posts: 9
Just added a second variant of the palignr-code... calculating two rows at a time, using lea reg, [reg + 2*step] to save two adds... and to unroll the loop a bit further... improved speed by another 13%... then improved runtime again by creating two variants of functions, one for "we wanna sum up less than 16 bytes per register" (requiring the pands for masking out undesired bytes) and the "we wanna sum up all 16-bytes in the register", without pands for masking anything out. Performance results:

 Code: cdif        dif is 5061436, took 6.202333040270 secs palign [2u] dif is 5061436, took 0.214480939358 secs ; not masking, two rows per iteration palign [2m] dif is 5061436, took 0.246009950970 secs ; masking, two rows per iteration palign [1]  dif is 5061436, took 0.248955060666 secs ; masking, single row per iteration pslldq      dif is 5061436, took 0.364519927047 secs ; masking, single row per iteration

The schematic code for addresses srcA and srcB with "address MOD 16 != 0" currently looks like this:

 Code: _absDifSSE3_01_01_twoLines: movdqa xmm0, [esi+16]             ; prefill   A / line 0 movdqa xmm1, [edi+16]             ; prefill   B / line 0 movdqa xmm2, [esi+eax+16]         ; prefill   A / line 1 movdqa xmm3, [edi+edx+16]         ; prefill   B / line 1 palignr xmm0, [esi], offsetA      ; complete  A / line 0 [offsetA = numerical constant = srcA MOD 16]  palignr xmm1, [edi], offsetB      ; complete  B / line 0 [offsetB = numerical constant = srcB MOD 16] palignr xmm2, [esi+eax], offsetA  ; complete  A / line 0 [offsetA = numerical constant = srcA MOD 16] palignr xmm3, [edi+edx], offsetB  ; complete  B / line 0 [offsetB = numerical constant = srcB MOD 16] pand xmm0, xmm7                   ; mask out undesired bytes [optional] pand xmm2, xmm7                   ; mask out undesired bytes [optional] pand xmm1, xmm7                   ; mask out undesired bytes [optional] pand xmm3, xmm7                   ; mask out undesired bytes [optional] lea esi, [esi + eax*2]            ; increment srcA by stepA * 2 lea edi, [edi + edx*2]            ; increment srcB by stepB * 2 psadbw xmm0, xmm1                 ; get abs dif line 0 psadbw xmm2, xmm3                 ; get abs dif line 1 paddq xmm6, xmm0                  ; add abs dif paddq xmm6, xmm2                  ; add abs dif dec ecx jne _absDifSSE3_01_01_twoLines

Does someone has any further optimization ideas? I must confess: I ran out of ideas.

Last edited by Justus73 on 18 Apr 2009, 13:11; edited 1 time in total
18 Apr 2009, 12:57
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 15324
Location: Bigweld Industries
Things you many want to try:
• Move just one LEA in between the two PADDQs
• Move both the two LEAs in between the two PADDQs.
• Replace DEC ECX with SUB ECX,1
• Reorder the PANDs to process registers in the order 0,1,2,3
18 Apr 2009, 13:06

Joined: 25 Sep 2003
Posts: 2146
Location: Estonia
Just to add to revolutions's list:

dec ecx / sub ecx,1 can be moved up

Some compilers move the counter all the way up, because nothing will change the flags when only SSE instructions are used. My other concern is that on newer CPUs, micro-/macro-op fusion is possible so cmp+Jcc might be calculated and issued in the same clock so no gain there.

_________________
My updated idol http://www.agner.org/optimize/
18 Apr 2009, 21:16
Justus73

Joined: 28 Mar 2009
Posts: 9
Thanks... tried the suggestions - "sub ecx, 1" instead of "dec ecx" and moving the sub some lines upwards brought around another percent... hard to measure, as I seem to reach the limit where "background activity noise" makes it hard to measure runtime differences... just came across agner's page http://www.agner.org, which offers some tool snippet in order to measure more precise...

Used VTune about ten years ago - that was nice... doesn't seem as there is some similar freeware/open source tool... or?
19 Apr 2009, 11:24
Sean4CC

Joined: 15 Apr 2015
Posts: 14
Agner Fog is one of the rudest guys you are ever likely to encounter. I speak from personal experience. He is from the same mold as Linus Torvalds.
17 Apr 2015, 13:27
 Display posts from previous: All Posts1 Day7 Days2 Weeks1 Month3 Months6 Months1 Year Oldest FirstNewest First

 Jump to: Select a forum Official----------------Blog General----------------MainDOSWindowsLinuxUnixMenuetOS Specific----------------MacroinstructionsCompiler InternalsIDE DevelopmentOS ConstructionNon-x86 architecturesHigh Level LanguagesProgramming Language DesignProjects and IdeasExamples and Tutorials Other----------------FeedbackHeapTest Area

Forum Rules:
 You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot vote in polls in this forumYou cannot attach files in this forumYou can download files in this forum