flat assembler
Message board for the users of flat assembler.

Index > Main > Optimization of SSE2 vector operation

Author
Thread Post new topic Reply to topic
mkm79



Joined: 27 Feb 2004
Posts: 8
mkm79 12 Jan 2007, 07:12
I have code of multiplication of 2 float vectors and calculation sum of all results

Code:
proc sse_code, v1, v2, size
  add    esp, -16

  ; this block will calculate the result:rest(eax:edx) of division <size> to 16 
  mov    edx, [size]
  mov    eax, edx
  shr    eax, 4
  mov    ecx, eax
  shl    ecx, 4
  sub    edx, ecx

  ;get addrs
  mov    ebx, [v1]
  mov    ecx, [v2]

  ;clear accumulator
  fldz

  ; work with blocks of 16 float values
  BEGIN1:
    dec    eax
    js     END1

    movups xmm0, [ebx]
    movups xmm4, [ecx]
    mulps  xmm0, xmm4

    movups xmm1, [ebx+16]
    movups xmm5, [ecx+16]
    mulps  xmm1, xmm5

    haddps xmm0, xmm1

    movups xmm2, [ebx+32]
    movups xmm6, [ecx+32]
    mulps  xmm2, xmm6

    movups xmm3, [ebx+48]
    movups xmm7, [ecx+48]
    mulps  xmm3, xmm7

    haddps xmm2, xmm3

    haddps xmm0, xmm2

    movups [esp-16], xmm0
    fadd   dword [esp-16]
    fadd   dword [esp-12]
    fadd   dword [esp-8]
    fadd   dword [esp-4]

    add    ebx, 64
    add    ecx, 64

    jmp    BEGIN1
  END1:

  ; REST >= 8
  cmp    edx, 8
  jl     SKIP1
  sub    edx, 8

  movups xmm0, [ebx]
  movups xmm2, [ecx]
  mulps  xmm0, xmm2

  movups xmm1, [ebx+16]
  movups xmm3, [ecx+16]
  mulps  xmm1, xmm3

  haddps xmm0, xmm1
  movups [esp-16], xmm0
  fadd   dword [esp-16]
  fadd   dword [esp-12]
  fadd   dword [esp-8]
  fadd   dword [esp-4]

  add    ebx, 32
  add    ecx, 32
  SKIP1:

  ; REST >= 4
  cmp    edx, 4
  jl     SKIP2
  sub    edx, 4

  movups xmm0, [ebx]
  movups xmm1, [ecx]
  mulps  xmm0, xmm1

  movups [esp-16], xmm0
  fadd   dword [esp-16]
  fadd   dword [esp-12]
  fadd   dword [esp-8]
  fadd   dword [esp-4]

  add    ebx, 16
  add    ecx, 16
  SKIP2:

  ; FPU rest muls
  BEGIN2:
    dec    edx
    js     END2
    fld    dword [ebx]
    fmul   dword [ecx]
    faddp  st1, st0
    add    ebx, 4
    add    ecx, 4
    jmp    BEGIN2
  END2:

  mov    esp, ebp
  pop    ebp
  ret    0x000c
endp
    


In the the simple function without SSE (only FPU) looks like:

Code:
proc fpu_code, v1, v2, size
  mov    ecx, [size]
  mov    eax, [v1]
  mov    ebx, [v2]
  fldz

  BEGIN1:
  dec    ecx
  js     END1
  fld    dword [eax+ecx*4]
  fmul   dword [ebx+ecx*4]
  faddp  st1, st0
  jmp    BEGIN1

END1:
  pop    ebp
  ret    0x000c
endp   
    


So the problem is that the <fpu_code> funtion is faster than <sse_code>
on big v1 and v2. (if size > 1e7, then )
I thought that SSE is faster because of big registers.

The questions is:
Is it possible to optimize the <sse_code> function for faster perfomance?

_________________
Best regards,
Kostya
Post 12 Jan 2007, 07:12
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 13 Jan 2007, 01:36
If you want to optimize you have to sacrifice some 'efficiency'.

Only allow sizes that are multiples of ~16 by padding the rest of the values with zero AND only allow aligned arrays so you can use aligned moves AND have the mulps's use the memory location for the second value.

Code:
proc sse_code, v1, v2, size 
mov edx,[size]
mov ecx,[v2]
mov eax,[v1]
sub edx,16
pxor xmm0,xmm0
.logic:
movaps xmm1,[eax+edx*4]
movaps xmm2,[eax+edx*4+16]
movaps xmm3,[eax+edx*4+32]
movaps xmm4,[eax+edx*4+48]
mulps xmm1,[ecx+edx*4]
mulps xmm2,[ecx+edx*4+16]
mulps xmm3,[ecx+edx*4+32]
mulps xmm4,[ecx+edx*4+48]
addps xmm1,xmm2
addps xmm3,xmm4
haddps xmm1,xmm1
haddps xmm3,xmm3
haddps xmm1,xmm1
haddps xmm3,xmm3
addss xmm0,xmm1
sub edx,16
addss xmm0,xmm3
jns .logic
ret 12
endp
    


Using SSE for a speed boost involves making assumptions about alignment and size. If you don't want to make those assumptions your speed won't improve as much.

The above code is kind of an extreme example, you can however scale it down to only loop 4 values at a time instead of 16 and then fill in the remaining values with scalar code. Or for even more optimization you can use a LUT to specialized sub functions when the size is less than ~16.
Post 13 Jan 2007, 01:36
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd! 13 Jan 2007, 03:16
I can see two problems with your code:
1) You are not thinking parallel enough: use more than 1 accumulator
2) Since the vector lengths you quote are way bigger than cache, your problems arise, most likely, because you are touching memory badly in some way. To fix this either try using lddqu instead of movups or make up separate cases for all of the possible alignments of v2 relative to v1. Assuming v1 and v2 are aligned 0 mod 4, then we have the following possibilities (assuming v1 aligned 0 mod 16)
Code:
; v2 aligned 0 mod 16
align 16
.0mod16:
   movaps  xmm2, [eax+ecx]
   mulps   xmm2, [eax+edx]
   addps   xmm0, xmm2
   movaps  xmm2, [eax+ecx+16]
   mulps   xmm2, [eax+edx+16]
   addps   xmm1, xmm2
   add     eax, 32
   js      .0mod16
    

Or for v2 aligned 4 mod 16:
Code:
; v2 aligned 4 mod 16
align 16
.4mod16:
   movaps  xmm2, [eax+edx]
   movaps  xmm3, xmm2
   palignr xmm2, xmm4, 4
   mulps   xmm2, [eax+ecx]
   addps   xmm0, xmm2
   movaps  xmm2, [eax+edx+16]
   movaps  xmm4, xmm2
   palignr xmm4, xmm3, 4
   mulps   xmm4, [eax+ecx+16]
   addps   xmm1, xmm2
   add     eax,  32
   js      .4mod16
    

And for v2 aligned 8 mod 16:
Code:
; v2 aligned 8 mod 16
align 16
.8mod16
   movaps  xmm2, [eax+edx]
   shufpd  xmm3, xmm2, 1
   mulps   xmm3, [eax+ecx]
   addps   xmm0, xmm3
   movaps  xmm3, [eax+edx+16]
   shufpd  xmm2, xmm3, 1
   mulps   xmm2, [eax+ecx+16]
   addps   xmm1, xmm2
   add     eax, 32
   js      .8mod16
    

A simple transformation handles the 12 mod 16 case, and the 8 accumulators may be summed via:
Code:
   addps   xmm1, xmm0
   xorpd   xmm0, xmm0
   movhlps xmm0, xmm1
   addps   xmm0, xmm1
   movsd   xmm1, xmm0
   psrld   xmm0, 32
   addss   xmm0, xmm1
    

Of course these are just untested inner loops, prologs and epilogs left as exercies to reader, but you get the idea: minimize the number of loads, make only aligned loads and adjust via data swizzling operations, and use multiple accumulators so that performance will be limited by load bandwidth rather than loop-carried latency, even with operands in cache.
Post 13 Jan 2007, 03:16
View user's profile Send private message Visit poster's website Reply with quote
mkm79



Joined: 27 Feb 2004
Posts: 8
mkm79 13 Jan 2007, 19:03
To r22:
I will try to make aligment by 16 [float] values in each vector. Thanks.

To Xorpd!:
Exellent instruction lddqu. About multiple accumulators I think I should use as many accumulators, as many cores CPU have (on dual-core 2 accumulators, quad-core-4 accums, etc). I am right, isn't it?
About minimization of memory accesses - I did not know, because in all cases I have only 16byte registers and not more, so in every case I should access memory N/16 times (as I know the CPU have only one memory controller and one memory interface)
Special thank for your point 1.

So - I will try all recommendations and will work out procedure by trying to mix different methods.

Thank for replies.
Post 13 Jan 2007, 19:03
View user's profile Send private message Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd! 16 Jan 2007, 09:05
I've been meaning to get back to this thread, but have been too tired lately. My formula for # accumulators is:
(# accumulators) = (# cores)*(# ways SIMD)*ceiling((loop-carried latency)/(2*(load reciprocal throughput))
For example, using Agner Fog's book as a reference, the load reciprocal throughputs seem to be 1 clock per 128-bit load, and the loop-carried latency is 3 or 4 clocks (for addps), and SSE2, as here, has 4-way SIMD parallelism, so we get
(# accumulators) = (# cores)*4*2 = 8*(# cores)
In multithreaded code, you can only see the code for one thread at a time, so the loops come out as I wrote them earlier with 8 accumulators: 4 in xmm0 and 4 in xmm1. Well, not exactly as I wrote them. Did you notice the bug in the 4 mod 16 loop? I tried a test case to see whether I got the shift count right:
Code:
format MS64 coff
section 'CODE' code readable executable align 16
align 16
public TEST1
TEST1:
   movaps  xmm0, [rcx]
   movaps  xmm1, [rcx+16]
   palignr xmm1, xmm0, 4
   movaps  [rcx+32], xmm1
   ret
    

I got results like:
01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10
11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F 20
05 06 07 08 09 0A 0B 0C 0D 0E 0F 10 11 12 13 14
where rcx pointed at the first byte of the above initially. This looks good because v2 should start at the byte containing the value 05 because it's aligned 4 mod 16 and rcx, hence memory accesses, should be aligned 0 mod 16. The start of the v2 vector does indeed get lined up correctly for multiplication with the start of the v1 vector. However, I made a couple of mistakes later in the inner loop, which should actually have read:
Code:
; v2 aligned 4 mod 16
align 16
.4mod16:
   movaps  xmm2, [eax+edx]
   movaps  xmm3, xmm2
   palignr xmm2, xmm4, 4
   mulps   xmm2, [eax+ecx]
   addps   xmm0, xmm2
   movaps  xmm2, [eax+edx+16]
   movaps  xmm4, xmm2
   palignr xmm2, xmm3, 4
   mulps   xmm2, [eax+ecx+16]
   addps   xmm1, xmm2
   add     eax,  32
   js     .4mod16
    

Ah well, it would have taken so much effort to code up a test case for this, and I was not even sure that you need an efficient sdot in any case. For instance, if I were going to code up an sgemm using SSE2+, the computational kernel I would build off of would be a fixed size and alignment sgemm rather than sdot because the latter is quite load-limited.

I note that I didn't communicate to you all of my meaning in my first message, but this isn't because of any language difference: there is a lot of prerequisite knowledge needed for writing optimal code on any processor, and it's not necessarily expressible in a reasonable way in any language. What I have found to be necessary is to write little test programs that can be timed via rdtsc and as you read documentation, test what you think the performance implications of the material are as you go along. In this way you can develop a feel for the characteristics of the target processor. BTW, what is the target processor in the context of this thread, and what OS, and what as the greater context in which the optimized code will function?
Post 16 Jan 2007, 09:05
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.