flat assembler
Message board for the users of flat assembler.

Index > Projects and Ideas > Efficient float-based SSE code 4 sum of squared differences

Author
Thread Post new topic Reply to topic
Justus73



Joined: 28 Mar 2009
Posts: 9
Justus73 28 Mar 2009, 10:02
Hi board,


as a brand new member of this board I would like to first thank you all. Before having registered (today) I read your posts, ideas and projects for some years now, and I enjoyed it a lot.

Currently, I rewrite an older (c++ with inline assembler) project of mine which makes excessively use of a simple matrix error calculation (2 dimensions), which is nothing but e = sum(a[x,y]-b[i,j])² for a rectangular area in a and b. That calculation is the "key time consumer" in the whole code. As I didn't code (SSE) assembly for a longer time now, I think my implementation is far from being optimal or efficient. So, I would kindly like to request your suggestions for optimizations.

Short C-code implementation what I am generally doing
Code:
float getErrorC(
    int channelIndex, 
    int srcX, 
    int srcY, 
    int txtX, 
    int txtY, 
    int width, 
    int height, 
    float maxError) {

    // channel elements are floats
    IplImage *pSrcChannel = this->pSource->getChannel(channelIndex);
    IplImage *pTxtChannel = this->pTexture->getChannel(channelIndex);

    float error = 0.0;

    int x, y;
    float *pSrcData;
    float *pTxtData;
    float dif;

    for (y=0; y<height; y++) {
      pSrcData = (float *) (pSrcChannel->imageData + 
          ((srcY + y) * pSrcChannel->widthStep)) +
           (srcX * sizeof(float));

      pTxtData = (float *) (pTxtChannel->imageData + 
          ((txtY + y) * pTxtChannel->widthStep)) +
           (txtX * sizeof(float));
      
      for (x=0; x<width; x++) {
        dif = *pSrcData - *pTxtData;
        pSrcData++;
        pTxtData++;
        error += (dif * dif);
      }

      if (error > maxError) {
        return -1.0f;
      }
    }

    return error;
}
    


And this is my old SSE-implementation I want re-write in FASM, pls excuse me to show it as inline code... (with focus on the __asm block)
Code:
float getError(
    int channelIndex, 
    int srcX, 
    int srcY, 
    int txtX, 
    int txtY, 
    int width, 
    int height, 
    float maxError) {

  // return value if calculation was aborted 'cause error got > maxError
  const float ERROR_TOO_BIG = -1.0f;

  IplImage *pSrcChannel = this->pSource->getChannel(channelIndex);
  IplImage *pTxtChannel = this->pTexture->getChannel(channelIndex);

  if (width < 8) {
    // too small, let the C-loop do that
    return this->getErrorC(channelIndex, srcX, srcY, txtX, txtY, width, height, maxError);
  }

  float *pSrcData;
  float *pTxtData;

  pSrcData = (float *) (pSrcChannel->imageData + 
      (srcY * pSrcChannel->widthStep) +
      (srcX * sizeof(float)));

  pTxtData = (float *) (pTxtChannel->imageData + 
      (txtY * pTxtChannel->widthStep) +
      (txtX * sizeof(float)));

  float result[4];

  // SSE-main loop renders max. of 8 pixels per iteration
  const unsigned int PELS_PER_LOOP = 8;
  // calculate amount of iterations:
  const unsigned int runs = ((unsigned int)width) / PELS_PER_LOOP;

  // calculate pitches from the end of the matrix areas to next rows first element
  unsigned int srcPitch = pSrcChannel->widthStep - (width * sizeof(float));
  unsigned int txtPitch = pTxtChannel->widthStep - (width * sizeof(float));

  float current;

  // create a mask to mask out elements whose
  // error otherwise would be added twice
  // (as we always process at least 4 elements at a time)
  unsigned int remainderMask[4];

  int remainder = width & 3;

  for (int i=0; i<4; i++) {
    if (i < remainder) {
      remainderMask[3-i] = 0xffffffff;
    } else {
      // mask out unwanteed elements
      remainderMask[3-i] = 0;
    }
  }

  __asm {
    mov edx, height            
    mov ecx, runs               
    mov eax, ecx                

    pxor xmm6, xmm6
    pxor xmm7, xmm7

    mov esi, pSrcData
    mov edi , pTxtData

  loop8:
    // todo: prefetch next line
    movups xmm0, [esi + 16*0]   // read src[0..3]
    movups xmm1, [edi + 16*0]   // read txt[0..3]

    movups xmm2, [esi + 16*1]   // read src[4..7]
    movups xmm3, [edi + 16*1]   // read txt[4..7]

    subps xmm0, xmm1            // src[0..3] -= txt[0..3]
    subps xmm2, xmm3            // src[4..7] -= txt[4..7]
    
    add esi, 16*2               // src += 8 * sizeof(float)
    add edi, 16*2               // src += 8 * sizeof(float)

    mulps xmm0, xmm0            // square error in xmm0
    mulps xmm2, xmm2            // square error in xmm2

    addps xmm6, xmm0            // add error
    addps xmm7, xmm2            // add error

    dec ecx                     // next iteration "8-elements loop"
    jne loop8

    mov ecx, [width]            // calulate remaining elements 
    and ecx, 7                  // (=width & 7)
    cmp ecx, 4                  // check if we need a 4-pixel per run loop:
    jl  noLoop4

    // we need to process 4 pels remainers:
    movups xmm0, [esi + 16*0]   // read src[0..3]
    movups xmm1, [edi + 16*0]   // read txt[0..3]

    subps xmm0, xmm1            // src[0..3] -= txt[0..3]

    add esi, 16*1               // src += 4 * sizeof(float)
    add edi, 16*1               // src += 4 * sizeof(float)

    mulps xmm0, xmm0            // square error in xmm0
    addps xmm6, xmm0            // add error
    
  noLoop4:                      // jump taken means remaining elements were less 4 

    and ecx, 3                  // check if we need to have a "1 to 3" iteration
    je noLoop1to3

    push ecx                    // store remainer count for scanline increment

                                // because reading always 4 elements at a time,
                                // I need to prevent reading more elements than
                                // in the matrix, causing a page fault in worst
                                // case when matrix address is at end of mem page
                                
                                // therefore, decrement matrix element position
                                // by 4 - remainer and read 4 elements, and
                                // mask out already calculated elements but not
                                // leaving valid address space
    
    neg ecx                     // -remainer
    add ecx, 4                  // += 4
    shl ecx, 2                  // *= 4 = last valid address in row not leaving calclation area
    sub esi, ecx                
    sub edi, ecx

    // we need to process UP TO 4 pels remainers:
    movups xmm0, [esi + 16*0]   // read src[0..3]
    movups xmm1, [edi + 16*0]   // read txt[0..3]

    add esi, ecx
    add edi, ecx

    subps xmm0, xmm1            // src[0..3] -= txt[0..3]

    movups xmm1, remainderMask  // mask out unwanted elements
    pand xmm0, xmm1

    pop ecx                     // reload remainder for scanline increment

    shl ecx, 2
    add esi, ecx                // src += remainder * sizeof(float)
    add edi, ecx                // txt += remainder * sizeof(float)

    mulps xmm0, xmm0            // square error in xmm0
    addps xmm6, xmm0            // add error

  noLoop1to3:
    add esi, srcPitch           // increment src to next scanline start
    add edi, txtPitch           // increment txt to next scanline start

    addps xmm7, xmm6            // add both error registers
    pxor xmm6, xmm6             // clear xmm6

    movaps xmm4, xmm7           // now ensure current error is below maxError, 
    movaps xmm5, xmm7           // therefore add all 4 floats in xmm7:
    psrldq xmm4, 4            
    addps xmm5, xmm4

    movaps xmm4, xmm5
    psrldq xmm4, 8
    addps xmm5, xmm4

    movss xmm4, maxError
    subps xmm4, xmm5
    movss [current], xmm4
    mov ecx, [current]

    and ecx, 0x80000000
    jne maxErrorExceeded

    mov ecx, eax                // re-set horz. run-counter

    dec edx
    jne loop8

    movups [result], xmm7
    jmp end
  maxErrorExceeded:
    xor ecx, ecx
    lea esi, result
    mov [esi + 0*4], ecx
    mov [esi + 1*4], ecx
    mov [esi + 2*4], ecx
    mov ecx, [ERROR_TOO_BIG]
    mov [esi + 3*4], ecx
  end:
  }

  float resultSum = result[0] + result[1] + result[2] + result[3];

  return resultSum;
}
    


On my Q6600 the performance dramatically drops if I create more specialized code (e.g. a block for each remaining elements case from one to three elements). I suppose the code block then doesn't fit into instruction cache anylonger. So what I need would be a good compromise between code size and performance (that's why I only add up 8 elements in main loop as I thought that's a good idea).

What do you think?
Post 28 Mar 2009, 10:02
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 30 Mar 2009, 13:44
1) replace movups with movaps (ICC will align pointers to m128 automatically, but I don't know how other compilers behave... always use caution)
2) I could really use some sample data and the size of it to make the optimizations. Is it in MB?
3) I would really like the code to be clearer to optimize more Very Happy I don't know what it does, what its goals are...

Code:
    movaps xmm0, [esi + 00]     // read src[0..3]
    movaps xmm2, [esi + 16]     // read src[4..7]

    subps xmm0, [edi + 00]      // you don't need to cache the destination
    subps xmm2, [edi + 16]      // just subtract
    
    add esi, 8*4                // make assembly resemble the C-code
    add edi, 8*4                // src += 8 * sizeof(float), which is 4

[---]

    // we need to process 4 pels remainers:
    movaps xmm0, [esi + 00]     // read src[0..3]
    subps xmm0, [edi + 00]      // read txt[0..3]

//;There are more movUps down the code...
    


This should make your code upto twice as fast because even if data is aligned and you used unaligned loads on older CPUs (Q6600), then you hit a penalty.

Its no excuse to use movups even when on a Core i7 Very Happy
Post 30 Mar 2009, 13:44
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4061
Location: vpcmpistri
bitRAKE 30 Mar 2009, 17:13
The code cache is 32k - this code is less than 1k.

What is the likelihood of MaxError being exceeded?
Post 30 Mar 2009, 17:13
View user's profile Send private message Visit poster's website Reply with quote
Justus73



Joined: 28 Mar 2009
Posts: 9
Justus73 30 Mar 2009, 20:49
Okay, let me explain further:

The code is part of a texture transfer tool. It is based upon Efros & Freeman's SIGGRAPH 2001 paper "Image Quilting for Texture Synthesis and Transfer" (from University of California Berkeley and Mitsubishi Electric Research Laboratories).

In short words explained:
Imagine we have two images, a "source image" and a "texture image". Furthermore, imagine we choose a small "transfer" area, let's say of 32x32 pixels.

The general idea behind the "texture transfer" is, to now move the "transfer" area over the source image in some steps (normally from top left to bottom right), always looking for the "best matching area" from "texture image" in order to "look like" the transfer area in the source image. The "thing that makes it work" is, that the transfer-areas of two following iterations "overlap" each other, for example by 1/3rd. In order to "not only create a similar-looking puzzle" from "source image" using puzzle-tiles from "texture image", the overlapping areas will be "painted" in a special way. Let's think about the vertical overlapping area of two "transfer areas", which would be of size 10x32 pixels (for example): we then look for that "path from y=0 to y=31" where the error between the texture and the current state is at it's minimum. Think about it, as if each "step" downwards (which can be either downleft-, down-, downright-wards) has a "cost" which is equal to (source[x, y] - texture[i, j])². The "right path" is then the path where the sum of those "costs of the steps" is minimum.

That path then is our "cut line to cut along", so we don't copy a rectangular block but along those paths in order to minimize optical errors between current image state and texture. This is done for the left and top edge of each block. Further, the image can be "transferred" in multiple iterations, reducing the transfer area's size from iteration to iteration. There is also a mechanism which allows a better preservation of the texture characteristics which have "yet been transported" (see the alpha formula on page 5 of the PDF), but that's not that relevant for the calculation.

That is, in short words, what my tool does. I hope, I explained it more or less understandable, and I hope the PDF helps to imagine what's all about - otherwise - please ask.

The "key CPU-time consumer" is to calculate the error between a source image block (NxM pixels large) and a texture image block (also NxM pixels large). There are many possibilities to "choose" a "block" from the texture image, for example by simply randomizing some (valid) X/Y positions in the texture image, calculate the errors between the source area and all (randomized) texture areas and use the one with the smallest error. Or one can iterate over all possible areas in the texture image - whatever one wants.

The code I posted is my "calculate the error between the luminosity channel of the source image (starting at position sx,sy) and the texture's luminosity channel of the texture image (starting at tx, ty).

"Normally", the transfer area has a size between 4 and 128 pixels, depending on the source image to be transferred, resulting in 4x4 up to 128x128 floating point error calculations "per try to locate a good match in the texture". In order to give you some numbers: you can assume you find an "acceptable looking" match with at least(!) 1.000 tries per transfer area, a "good looking" can be found with at least 10.000 tries per transfer area, assuming a transfer area of 32x32 pixels size and a texture of 512x384 pixels. As you already assume, the larger the texture the more tries one should start to find a good match "by random". To stay in the example with the 512x384 pixels wide "texture image" and 32x32 pixels transfer area: finding the best match would require (512-32+1)x(384-32+1) tries, means 169.793 tries. But, in fact, one gets already good result checking a 1/10th of the possible blocks.

@Madis731:
The idea of using movaps instead of movups would require me to store (as well the texture as well the source image) four times, one time with element 0,4,8,12... at a 16-byte boundary, then with element 1, 5, 9, 13... at a 16-byte boundary and so on. Assuming source images up to 4096x3072 pixels, the memory requirement for a single instance is already

Code:
(4096 x 3072 pixels) x (4 channels Lum, R, G, B) x (4 bytes per float) = 201.326.592 = 192 MiBs.    
(hope that answers your second question)

I did not yet think about holding image channel data four times, but I think maybe it's worth a try, "twice as fast" sounds promising =)
Sample data is possible and no problem, but, in fact, you can use any two grayscale-image's pixelvalues and convert them to interval [0.0f, 1.0f] applying 1/255th as factor. If you want to, I can create some source code fragment with a whole bunch of "df's" to be includable via "file" directive, then forming the texture and source image in constant data definitions...

To make the stuff a bit more clear: think of the two images of size
Code:
textureWidth x textureHeight    
and
Code:
sourceWidth x sourceHeight    
as arrays of floats. The code does nothing more but

Code:
error = 0
// let (sx, sy) specify the top left point in source image
// let (tx, ty) specify the top left point in texture image
// whereas width and height of both areas is equal
for (y = 0 upto transferAreaHeight-1) {
  for (x = 0 upto transferAreaWidth-1) {
    luminosityDifference = sourceImage[sx+x, sy+y] - textureImage[tx+x, ty+y]
    error = error + (luminosityDifference * luminosityDifference)
  }

  if (error >= maxError) {
    // cancel, cannot be a new "best match"
    return -1
  }
}
// looks like a new best match candidate:
return error
    


The code says "looks like a new best match candidate" in the last row of pseudo-code, because due to the "read only" access to the images, the code can be executed multithreaded, allowing N threads look up "their" best match... choosing the best one of the N results in the last step. The advantage of "synchronizing the best match's error values" over the threads is very low, so it's not worth "sharing" a new best match's error over all threads - each one has it's own private "best match error value = maxError", so to say.

@bitRAKE
As explained in the pseudo-code above: the code is being executed in order to "check" whether that area is a "new best match" candidate. That means, if we already know the error of the "current best match" we can cancel further calculations as soon as we exceed that border, because the candidate cannot be a new best match anylonger. That's the reason why to check for not yet exceeded maxError from time to time.

Little statistic about the "maxError-cancellation" mechanism: 85% of all error calculations are cancelled before finishing 75% of the calculation. That, of course, depends on "when hitting the best candidate". The sooner one hits the best candidate, the more calculations are canceled "sooner".

32k instruction cache? Per core as I see? Fuck, I have definitly missed some CPU-generations... and I thought I would have to split up the loop into smaller pieces in any case... but 32k should be enough to hold, let's say 5 specialized calculation loops, with 8 elements per loop executed "width / 8"-times, and then "the remaining elements" with a 4 elements per loop code, 3 elements per loop, two elements per loop and a single element per loop, so I cover any possible permutation of "width MOD 8".

At the moment, I try to understand the data prefetching instructions, due to knowing "what scanline-part" will be calculated next I hope to save some further cycles, but I did not yet check what gain I can expect from that.

Please don't be impatient if my reaction time is poor, my wife just told me we got to babysit her brother's 4 little daughters from thursday to sunday (NOOOooooo.....)
Requiring some preparations, leaving me little time for my fun projects...

Some final words: tonite, when I could not get some sleep due to too much coffee because being too tired because of too few sleep (I know... I know...) I thought about the idea "slicing" up the area into columns, having a "top to bottom" calculation... for example, calculating 15x15 pixels in the following way:

1) calculate 8 elements/loop from top to bottom
2) increment start addresses by 8 elements
3) calculate 4 elements/loop from top to bottom
4) increment start addresses by 4 elements
5) calculate 3 elements/loop from top to bottom

The advantage would be, that I can write multiple codes for different block widths, and chain them together so they calculate "stripewise"... the code-size of the loops would be small and branch hinting (does that really save time?) would be easy to guess... but don't know how much of the iteration costs (3 x "loop iteration costs" instead of 1 x) eat up the performance gain by better instruction distribution on different registers.
Post 30 Mar 2009, 20:49
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4061
Location: vpcmpistri
bitRAKE 31 Mar 2009, 06:05
Interesting paper.

Are you really doing this on HDR data? Maybe it's better to operate on the integer data - there is a sum of absolute difference instruction. Very Happy
Post 31 Mar 2009, 06:05
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 31 Mar 2009, 07:47
You can still use aligned loads when the offsets aren't aligned. You just need alignment code on both ends.

Code:
Quoting from Intel manual 248966.pdf (Optimization Reference)
Throughput (cycles):
                        Core i7         45nm Core2      65nm Core2
Alignment Scenario      06_1AH          06_17H          06_0FH
16B aligned &             1               2               2
Not-16B aligned,
not cache split           1              ~2              ~2
Split $-line boundary    ~4.5           ~20             ~20
    


What I propose is that when unaligned hits you a 20cycle penalty in L1 cache, then you have about 19 cycles to align your code before the loop starts.
By the time the 20cycles ends you might have already loaded 16 bytes in another way:
Code:
  ;movups xmm0, [esi + 00]     // This would cost us about 20cycles penalty
  mov     eax,esi
  and     esi,not 0Fh
  movaps  xmm0,[esi + 00]
  sub     eax,esi
  jz      .aligned
  movaps  xmm1,[esi + 16]
  cmp     eax,4
  jne     @f
  palignr xmm1,xmm0,4
@@:
  cmp     eax,8
  jne     @f
  palignr xmm1,xmm0,8
@@:
  cmp     eax,12
  jne     @f
  palignr xmm1,xmm0,12
@@:;                            //This alternative would cost roughly 10cycles
  ;Here xmm1 has the first 16 bytes
[---]
.aligned:
  ;xmm0 has the first 16 bytes
[---]
    

_________________
My updated idol Very Happy http://www.agner.org/optimize/
Post 31 Mar 2009, 07:47
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Justus73



Joined: 28 Mar 2009
Posts: 9
Justus73 31 Mar 2009, 10:22
First of all thank you for you suggestions - your two information allow me to try and optimize a lot, I guess the combination of both your suggestions gonna rock... gonna try those as soon as I can... Very Happy

@Madis731
Quote:
You can still use aligned loads when the offsets aren't aligned. You just need alignment code on both ends.


Damn - you're so right... and I even already used that - during my working time in the gaming industry I wrote an excessively high optimized "isometric tile"-painting routine which used specially implemented copy-instructions for the leading and trailing "misaligned" parts... one for each possible misaligned "source address" and "destination address" combination (scanline-wise copying using jump-tables to look corresponding copy-method up)... that was on Pentium II... god, I'm getting old... really forgot that elementar knowledge... óO

@bitRAKE
Quote:
there is a sum of absolute difference instruction


Thanks - completely missed that one - as far as I just saw, the operation works on 16- or even 32-bytes at a time. Hmmm... that would allow a lot of things... for example, using the SAD (sum of abs. distances) for "fast identifying candidates", and then using the preciser SSD (sum of squared distances) on those candiates only. I would'nt lose much if I switch back to byte data, except I need to ensure that error calculations on blocks whose theoretical maximum errors exceed 255x256 caused by (width x height x 255 > 255x256) are subdivided into smaller partwise calculations, but that's easy...
Post 31 Mar 2009, 10:22
View user's profile Send private message Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2341
Location: Usono (aka, USA)
rugxulo 01 Apr 2009, 12:01
Madis731 wrote:
1) replace movups with movaps (ICC will align pointers to m128 automatically, but I don't know how other compilers behave... always use caution)
...
This should make your code up to twice as fast because even if data is aligned and you used unaligned loads on older CPUs (Q6600), then you hit a penalty.


Is he using GCC?

gcc.info wrote:

`aligned (ALIGNMENT)'
This attribute specifies a minimum alignment for the variable or
structure field, measured in bytes. For example, the declaration:

int x __attribute__ ((aligned (16))) = 0;

causes the compiler to allocate the global variable `x' on a
16-byte boundary. On a 68040, this could be used in conjunction
with an `asm' expression to access the `move16' instruction which
requires 16-byte aligned operands.

You can also specify the alignment of structure fields. For
example, to create a double-word aligned `int' pair, you could
write:

struct foo { int x[2] __attribute__ ((aligned (8))); };

This is an alternative to creating a union with a `double' member
that forces the union to be double-word aligned.


Madis731 wrote:

Its no excuse to use movups even when on a Core i7 :D


If he wants to use SSE3, he can use LDDQU:

Wikipedia wrote:

[SSE3 adds] LDDQU, an alternative misaligned integer vector load that
has better performance on NetBurst architectures for loads that cross
cacheline boundaries. It can be helpful for video compression tasks.


Justus wrote:

Please don't be impatient if my reaction time is poor, my wife just told me we got to babysit her brother's 4 little daughters from thursday to sunday (NOOOooooo.....)
Requiring some preparations, leaving me little time for my fun projects...


I'm sure they're so cute that you won't even mind. :D
Post 01 Apr 2009, 12:01
View user's profile Send private message Visit poster's website Reply with quote
Justus73



Joined: 28 Mar 2009
Posts: 9
Justus73 01 Apr 2009, 19:04
Quote:
Is he using GCC?

Currently, I try to keep the code compileable using GCC and VS2005, and the latter uses the syntax __declspec(align(16)) or #pragma pack(16), which does the same.

Quote:
If he wants to use SSE3, he can use LDDQU

Thanks for the hint, I'm gonna give it a try when I have the necessary time..

Quote:
I'm sure they're so cute that you won't even mind

Well, they are very cute... but as my parent's only child I'm not used to three little, and loud, girls... strains nerves... but... =)
Post 01 Apr 2009, 19:04
View user's profile Send private message Reply with quote
Justus73



Joined: 28 Mar 2009
Posts: 9
Justus73 15 Apr 2009, 14:36
Hi mites,

for those who are interested: work is going on.

How things currently look alike:

- decided to use ImageCV for image IO-purposes
- ImageCV uses 32-byte-alignment for channel data (which fits SSE code-movaps)
- I "enhance" image width so next scanline always starts on adress (A % 16 bytes == 0)
- wrote a code generator which creates 256-methods for checking for a new "best match" candidate (16 address offsets for texture x 16 address offsets for source image)
- meaning, I use two "movaps" and corresponding "pslldq" and "psrldq" for reading "16-bytes at a time, with no penalties at all"
- using "psadbw" for checking 16 pixels at a time using the luminosity channel, masking out "undesired" pixels

All of your suggestions (not using 32-bit floats but 8-bit data instead, handling 16 pixels at a time) caused my code speeding up roundabout factor 4. Using 256 specialized methods for reading and processing data caused another speedup with roundabout factor 3.

Thank you all for your suggestions! My current runtime is roundabout 1/12 of the "initial" runtime, and I'm still not done optimizing!
Post 15 Apr 2009, 14:36
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 15 Apr 2009, 16:17
Did you try what is the effect between PS**DQ vs. PALIGNR. They should issue in the same time, but you never know Smile. Maybe its just easier to use shifts algorithmically, because PALIGNR works only one way (right Smile ). It would be curious how it affects the performance because it was designed for the specific cases you talk about.

It would be beneficial in a case like this:
Code:
;A) PS**DQ
movdqa  xmm0,[source_mem]
psrldq  xmm0,constant_a ;shift data out
pslldq  xmm0,constant_a ;bring back zeros
movdqa  xmm1,[source_mem+16]
pslldq  xmm1,constant_b ;shift to the other wall
psrldq  xmm1,constant_b ;bring back zeros
;Now we have xmm0= 00000AAA and xmm1=BBBBB000 for example
por     xmm0,xmm1 ; BBBBBAAA is merged

;B) PALIGNR
movdqa  xmm0,[source_mem+16] ;Second 16 bytes
palignr xmm0,[source_mem],constant_a/8 ; (second+first) shr bits/8
;Now we have xmm0= BBBBBAAA for example
    

Just some data I gathered:
1) On 65nm PSLL/RLDQ is 2-clock instruction
2) On 45nm PSLL/RLDQ is 1-clock instruction
3) On 65nm PALIGNR is 2-clock instruction
4.a) On 45nm PALIGNR is 2- or 3-clock instruction if it acts on MMX registers
4.b) On 45nm PALIGNR is 1-clock instruction if it acts on SSE registers
So just maybe you can benefit if your program fits in this scenario Smile

PS. Maybe you meant 'mate', not the person pictured here: http://www.camden.rutgers.edu/%7Ebwhitlow/AMULET/webpagestuff/dads_dust_mite.jpg Very Happy but hey, sometimes ASM-code is so small that we resemble 'em Razz
Post 15 Apr 2009, 16:17
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Justus73



Joined: 28 Mar 2009
Posts: 9
Justus73 17 Apr 2009, 16:17
Okay, good news is: palignr is about 30% faster than the pslldq/psrldq/por-solution. Bad news is: Visual Studio compiler is heavily pissing me off - probably I'm simply too tired, but I implemented a simple C-version of the "compute 16 absolute differences per iteration" SSE-code... I wanted to compare the speed to the SSE-versions. And what I got is a more or less constant runtime duration of the C-code... means independently how often I call the C-implementation - the runtime is always the same... think I'm missing some point... sounds like a bug to find easily, but... console output tells me how often the method was called and what the calculated error is - both is correct... so... I better get some more sleep and investigate the C-code voodoo effect. But nevertheless, the SSE-code can be compared, so let's have a look:

Code:
; Executing methods 1 time with 1.000.000 iterations each
palign dif is 16000000, took 0.001828 secs
pslldq dif is 16000000, took 0.002719 secs
cdif   dif is 16000000, took 0.012741 secs
    


Well, that's okay, as it seems plausible. Now, in order to check the "jump-table jump" overhead in the SSE-variants, I raised method-call repetitions to 1000:

Code:
; Executing methods 1.000 times with 1.000.000 iterations each
palign dif is 16000000, took 1.772620 secs
pslldq dif is 16000000, took 2.728955 secs
cdif   dif is 16000000, took 0.012705 secs
    


See the problem? Calling the C-implementation 1.000 times is (in total!) slightly faster than calling it once. Okay, no problem, I thought, I'm obviously having some bug in the outer loop. Looks like the C-method is always only called once. So I inserted a printf-instruction to see how many calls the method passes. And there were 1.000 outputs the method was completely executed. And the result of each of the 1.000 executions is correct, as each byte differs by 1. And 1.000.000 iterations times 16 bytes at a time times 1 difference is 16.000.000, which was reported 1.000 times correctly.

Currently that voodoo-effect makes me sick. I already generated an assembly of the C-code and it looks okay and right, a whole bunch of movzx, 16 short jumps for correcting each difference's sign if necessary, then the "per iteration"-loop... looks okay. But... independent how often I call the C-method in the outer loop: the execution time seems to be constant. Maybe there's something wrong with my query performance counter / frequency time measurement, but I don't think so as the C-code is nearly instantly done, whereas the SSE-code really takes one, two seconds until it finishes.

Whatever that C-code voodoo-effect is, when I found the bug I'll tell comparison results between the C-code and the SSE-versions. Let's go on with the SSE-code which seem to behave "as expected", having some more or less linear CPU-consumption, fuck that voodoo until I've had some more sleep ^^

For a better imagination how the executed SSE-code looked, I copied two exemplaric loops of both SSE and the C-code solution. All assume, that we try to operate on two addresses, 'A' and 'B', whereas address of 'A' is (A MOD 16 == 1) and address of 'B' is (B MOD 16 == 2), resulting in A to be assembled from 15+1 bytes and B to be assembled from 14+2 bytes.

Code:
; let's have shifting fun:
absdif_01_02:
movaps xmm0, [esi]      ; first part of A
movaps xmm1, [edi]      ; first part of B
movaps xmm2, [esi + 16] ; second part of A
movaps xmm3, [edi + 16] ; second part of B
psrldq xmm0, 1          ; A: right shift out 1 byte to be taken from second part
psrldq xmm1, 2          ; B: right shift out 2 byte to be taken from second part
pslldq xmm2, 15         ; A: shift desired byte to be inserted leftmost
pslldq xmm3, 14         ; B: shift desired bytes to be inserted leftmost
paddb xmm0, xmm2        ; merge A
paddb xmm1, xmm3        ; merge B
pand xmm0, xmm7         ; mask out unwanted bytes (width < 16)
add esi, eax            ; increment A ptr
pand xmm1, xmm7         ; mask out unwanted bytes (width < 16)
add edi, edx            ; increment B ptr
psadbw xmm0, xmm1       ; get abs difference
paddq xmm6, xmm0        ; and add to global result
dec ecx                 ; loop
jne absdif_01_02
    


Here the generated code using your requested palign-solution:

Code:
absdifpar_01_02:
movdqa xmm0, [esi+16]   ; prefill with second part of A
movdqa xmm1, [edi+16]   ; prefill with second part of B
palignr xmm0, [esi], 1  ; fill in first part of A -> done
palignr xmm1, [edi], 2  ; fill in first part of B -> done
pand xmm0, xmm7         ; A: mask out unwanted bytes (width < 16)
add esi, eax            ; increment A ptr
pand xmm1, xmm7         ; B: mask out unwanted bytes (width < 16)
add edi, edx            ; increment B ptr
psadbw xmm0, xmm1       ; get abs difference
paddq xmm6, xmm0        ; add to global var
dec ecx                 ; loop
jne absdifpar_01_02
    


As you already mentioned, the palignr perfectly fits due to the "we must insert leftmost bytes shifting prefilled value right".

Well, as I already mentioned, the palignr-solution is about 30% faster... and as I just realized, I have two instructions on the same register following immediately each other - that psadbw and the paddq... another point I have to check when I've had some more sleep... but as the code-generator does the work changing all 256 methods for me, let's have a nice dream...

PS: sorry about that mItes and mAtes confusion - of course I didn't want to call you insects =)

n8, mates *g*
Post 17 Apr 2009, 16:17
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 17 Apr 2009, 19:48
Smile no need to be sorry, these confusions make people laugh. At least I got mine for that day.

Anyway, to shed some light on your compiler misbehaviour. Imagine this code:
Code:
long long sum=0;
for(long long x=0;x<1000000;x++)
    sum+=x;
printf("%lld\n",sum);
    

This will print you 500000500000 as you would expect, but the code it generates will be something like this:
Code:
  mov  rcx,$fmt_str_packet_1
  mov  rdx,500000500000
  call printf
    

Hold on! Where's the loop?

The loop magically appears if you put a printf() every time you add to the sum. Because then it has to work. Compilers are very good in optimizing everything out.

Once I tried some SSE functions and my Intel compiler "optimized" them out because I didn't use their results - but I just wanted to see the code it generates Razz

I went great lengths to get the SSE-code to appear in assembly, most of the time it just pre-calculated the value (because I gave it constants in the code Smile ) and the code was as short as mov reg,value Smile
Post 17 Apr 2009, 19:48
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Justus73



Joined: 28 Mar 2009
Posts: 9
Justus73 18 Apr 2009, 08:43
Once more you were right, Madis... the compiler optimized away the main work, creating an array of absolute distances and summing up those elements... the movzx-instructions I saw only created the array...

I think it's a stupid idea to initialize an only 18-bytes wide array with arr[i] = i, and constantly pass &arr[1] and &arr[2] to the test methods, using a constant address increment of 0 bytes per loop. The compiler did not optimize away the loop, but he 'precalculated' the array of absolute distances and stored that in an 32-bit value array, from which he summed up. The mozx-instructions I mentioned initialized that array.

When I changed the test to "real-life" scenario, loading a hudge dimensioned image and let all methods operate on the luminosity channel, the compiled code and execution times reflected the real CPU costs of the C-code...

So, here's the "realistic" result of the performance comparison between C-code and SSE-codes...

Code:
cdif   dif is 5061993, took 6.199838933762 secs
palign dif is 5061993, took 0.247584427027 secs
pslldq dif is 5061993, took 0.369428369240 secs
    
Post 18 Apr 2009, 08:43
View user's profile Send private message Reply with quote
Justus73



Joined: 28 Mar 2009
Posts: 9
Justus73 18 Apr 2009, 12:57
Just added a second variant of the palignr-code... calculating two rows at a time, using lea reg, [reg + 2*step] to save two adds... and to unroll the loop a bit further... improved speed by another 13%... then improved runtime again by creating two variants of functions, one for "we wanna sum up less than 16 bytes per register" (requiring the pands for masking out undesired bytes) and the "we wanna sum up all 16-bytes in the register", without pands for masking anything out. Performance results:

Code:
cdif        dif is 5061436, took 6.202333040270 secs
palign [2u] dif is 5061436, took 0.214480939358 secs ; not masking, two rows per iteration
palign [2m] dif is 5061436, took 0.246009950970 secs ; masking, two rows per iteration
palign [1]  dif is 5061436, took 0.248955060666 secs ; masking, single row per iteration
pslldq      dif is 5061436, took 0.364519927047 secs ; masking, single row per iteration
    


The schematic code for addresses srcA and srcB with "address MOD 16 != 0" currently looks like this:

Code:
_absDifSSE3_01_01_twoLines:
movdqa xmm0, [esi+16]             ; prefill   A / line 0
movdqa xmm1, [edi+16]             ; prefill   B / line 0
movdqa xmm2, [esi+eax+16]         ; prefill   A / line 1
movdqa xmm3, [edi+edx+16]         ; prefill   B / line 1
palignr xmm0, [esi], offsetA      ; complete  A / line 0 [offsetA = numerical constant = srcA MOD 16] 
palignr xmm1, [edi], offsetB      ; complete  B / line 0 [offsetB = numerical constant = srcB MOD 16]
palignr xmm2, [esi+eax], offsetA  ; complete  A / line 0 [offsetA = numerical constant = srcA MOD 16]
palignr xmm3, [edi+edx], offsetB  ; complete  B / line 0 [offsetB = numerical constant = srcB MOD 16]
pand xmm0, xmm7                   ; mask out undesired bytes [optional]
pand xmm2, xmm7                   ; mask out undesired bytes [optional]
pand xmm1, xmm7                   ; mask out undesired bytes [optional]
pand xmm3, xmm7                   ; mask out undesired bytes [optional]
lea esi, [esi + eax*2]            ; increment srcA by stepA * 2
lea edi, [edi + edx*2]            ; increment srcB by stepB * 2
psadbw xmm0, xmm1                 ; get abs dif line 0
psadbw xmm2, xmm3                 ; get abs dif line 1
paddq xmm6, xmm0                  ; add abs dif
paddq xmm6, xmm2                  ; add abs dif
dec ecx
jne _absDifSSE3_01_01_twoLines
    


Does someone has any further optimization ideas? I must confess: I ran out of ideas.


Last edited by Justus73 on 18 Apr 2009, 13:11; edited 1 time in total
Post 18 Apr 2009, 12:57
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20433
Location: In your JS exploiting you and your system
revolution 18 Apr 2009, 13:06
Things you many want to try:
  • Move just one LEA in between the two PADDQs
  • Move both the two LEAs in between the two PADDQs.
  • Replace DEC ECX with SUB ECX,1
  • Reorder the PANDs to process registers in the order 0,1,2,3
Post 18 Apr 2009, 13:06
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 18 Apr 2009, 21:16
Just to add to revolutions's list:

    dec ecx / sub ecx,1 can be moved up

Some compilers move the counter all the way up, because nothing will change the flags when only SSE instructions are used. My other concern is that on newer CPUs, micro-/macro-op fusion is possible so cmp+Jcc might be calculated and issued in the same clock so no gain there.

_________________
My updated idol Very Happy http://www.agner.org/optimize/
Post 18 Apr 2009, 21:16
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Justus73



Joined: 28 Mar 2009
Posts: 9
Justus73 19 Apr 2009, 11:24
Thanks... tried the suggestions - "sub ecx, 1" instead of "dec ecx" and moving the sub some lines upwards brought around another percent... hard to measure, as I seem to reach the limit where "background activity noise" makes it hard to measure runtime differences... just came across agner's page http://www.agner.org, which offers some tool snippet in order to measure more precise...

Used VTune about ten years ago - that was nice... doesn't seem as there is some similar freeware/open source tool... or?
Post 19 Apr 2009, 11:24
View user's profile Send private message Reply with quote
Sean4CC



Joined: 15 Apr 2015
Posts: 14
Sean4CC 17 Apr 2015, 13:27
Agner Fog is one of the rudest guys you are ever likely to encounter. I speak from personal experience. He is from the same mold as Linus Torvalds.
Post 17 Apr 2015, 13:27
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.