flat assembler
Message board for the users of flat assembler.
Index
> Main > Optimization of SSE2 vector operation 
Author 

r22
If you want to optimize you have to sacrifice some 'efficiency'.
Only allow sizes that are multiples of ~16 by padding the rest of the values with zero AND only allow aligned arrays so you can use aligned moves AND have the mulps's use the memory location for the second value. Code: proc sse_code, v1, v2, size mov edx,[size] mov ecx,[v2] mov eax,[v1] sub edx,16 pxor xmm0,xmm0 .logic: movaps xmm1,[eax+edx*4] movaps xmm2,[eax+edx*4+16] movaps xmm3,[eax+edx*4+32] movaps xmm4,[eax+edx*4+48] mulps xmm1,[ecx+edx*4] mulps xmm2,[ecx+edx*4+16] mulps xmm3,[ecx+edx*4+32] mulps xmm4,[ecx+edx*4+48] addps xmm1,xmm2 addps xmm3,xmm4 haddps xmm1,xmm1 haddps xmm3,xmm3 haddps xmm1,xmm1 haddps xmm3,xmm3 addss xmm0,xmm1 sub edx,16 addss xmm0,xmm3 jns .logic ret 12 endp Using SSE for a speed boost involves making assumptions about alignment and size. If you don't want to make those assumptions your speed won't improve as much. The above code is kind of an extreme example, you can however scale it down to only loop 4 values at a time instead of 16 and then fill in the remaining values with scalar code. Or for even more optimization you can use a LUT to specialized sub functions when the size is less than ~16. 

13 Jan 2007, 01:36 

Xorpd!
I can see two problems with your code:
1) You are not thinking parallel enough: use more than 1 accumulator 2) Since the vector lengths you quote are way bigger than cache, your problems arise, most likely, because you are touching memory badly in some way. To fix this either try using lddqu instead of movups or make up separate cases for all of the possible alignments of v2 relative to v1. Assuming v1 and v2 are aligned 0 mod 4, then we have the following possibilities (assuming v1 aligned 0 mod 16) Code: ; v2 aligned 0 mod 16 align 16 .0mod16: movaps xmm2, [eax+ecx] mulps xmm2, [eax+edx] addps xmm0, xmm2 movaps xmm2, [eax+ecx+16] mulps xmm2, [eax+edx+16] addps xmm1, xmm2 add eax, 32 js .0mod16 Or for v2 aligned 4 mod 16: Code: ; v2 aligned 4 mod 16 align 16 .4mod16: movaps xmm2, [eax+edx] movaps xmm3, xmm2 palignr xmm2, xmm4, 4 mulps xmm2, [eax+ecx] addps xmm0, xmm2 movaps xmm2, [eax+edx+16] movaps xmm4, xmm2 palignr xmm4, xmm3, 4 mulps xmm4, [eax+ecx+16] addps xmm1, xmm2 add eax, 32 js .4mod16 And for v2 aligned 8 mod 16: Code: ; v2 aligned 8 mod 16 align 16 .8mod16 movaps xmm2, [eax+edx] shufpd xmm3, xmm2, 1 mulps xmm3, [eax+ecx] addps xmm0, xmm3 movaps xmm3, [eax+edx+16] shufpd xmm2, xmm3, 1 mulps xmm2, [eax+ecx+16] addps xmm1, xmm2 add eax, 32 js .8mod16 A simple transformation handles the 12 mod 16 case, and the 8 accumulators may be summed via: Code: addps xmm1, xmm0 xorpd xmm0, xmm0 movhlps xmm0, xmm1 addps xmm0, xmm1 movsd xmm1, xmm0 psrld xmm0, 32 addss xmm0, xmm1 Of course these are just untested inner loops, prologs and epilogs left as exercies to reader, but you get the idea: minimize the number of loads, make only aligned loads and adjust via data swizzling operations, and use multiple accumulators so that performance will be limited by load bandwidth rather than loopcarried latency, even with operands in cache. 

13 Jan 2007, 03:16 

mkm79
To r22:
I will try to make aligment by 16 [float] values in each vector. Thanks. To Xorpd!: Exellent instruction lddqu. About multiple accumulators I think I should use as many accumulators, as many cores CPU have (on dualcore 2 accumulators, quadcore4 accums, etc). I am right, isn't it? About minimization of memory accesses  I did not know, because in all cases I have only 16byte registers and not more, so in every case I should access memory N/16 times (as I know the CPU have only one memory controller and one memory interface) Special thank for your point 1. So  I will try all recommendations and will work out procedure by trying to mix different methods. Thank for replies. 

13 Jan 2007, 19:03 

Xorpd!
I've been meaning to get back to this thread, but have been too tired lately. My formula for # accumulators is:
(# accumulators) = (# cores)*(# ways SIMD)*ceiling((loopcarried latency)/(2*(load reciprocal throughput)) For example, using Agner Fog's book as a reference, the load reciprocal throughputs seem to be 1 clock per 128bit load, and the loopcarried latency is 3 or 4 clocks (for addps), and SSE2, as here, has 4way SIMD parallelism, so we get (# accumulators) = (# cores)*4*2 = 8*(# cores) In multithreaded code, you can only see the code for one thread at a time, so the loops come out as I wrote them earlier with 8 accumulators: 4 in xmm0 and 4 in xmm1. Well, not exactly as I wrote them. Did you notice the bug in the 4 mod 16 loop? I tried a test case to see whether I got the shift count right: Code: format MS64 coff section 'CODE' code readable executable align 16 align 16 public TEST1 TEST1: movaps xmm0, [rcx] movaps xmm1, [rcx+16] palignr xmm1, xmm0, 4 movaps [rcx+32], xmm1 ret I got results like: 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F 20 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10 11 12 13 14 where rcx pointed at the first byte of the above initially. This looks good because v2 should start at the byte containing the value 05 because it's aligned 4 mod 16 and rcx, hence memory accesses, should be aligned 0 mod 16. The start of the v2 vector does indeed get lined up correctly for multiplication with the start of the v1 vector. However, I made a couple of mistakes later in the inner loop, which should actually have read: Code: ; v2 aligned 4 mod 16 align 16 .4mod16: movaps xmm2, [eax+edx] movaps xmm3, xmm2 palignr xmm2, xmm4, 4 mulps xmm2, [eax+ecx] addps xmm0, xmm2 movaps xmm2, [eax+edx+16] movaps xmm4, xmm2 palignr xmm2, xmm3, 4 mulps xmm2, [eax+ecx+16] addps xmm1, xmm2 add eax, 32 js .4mod16 Ah well, it would have taken so much effort to code up a test case for this, and I was not even sure that you need an efficient sdot in any case. For instance, if I were going to code up an sgemm using SSE2+, the computational kernel I would build off of would be a fixed size and alignment sgemm rather than sdot because the latter is quite loadlimited. I note that I didn't communicate to you all of my meaning in my first message, but this isn't because of any language difference: there is a lot of prerequisite knowledge needed for writing optimal code on any processor, and it's not necessarily expressible in a reasonable way in any language. What I have found to be necessary is to write little test programs that can be timed via rdtsc and as you read documentation, test what you think the performance implications of the material are as you go along. In this way you can develop a feel for the characteristics of the target processor. BTW, what is the target processor in the context of this thread, and what OS, and what as the greater context in which the optimized code will function? 

16 Jan 2007, 09:05 

< Last Thread  Next Thread > 
Forum Rules:

Copyright © 19992020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.
Website powered by rwasa.