flat assembler
Message board for the users of flat assembler.
Index
> Main > Only seeing a 5% speed up with SSE2, please review |
Author |
|
mattst88 11 Nov 2009, 02:16
The library pixman used on Linux for compositing operations supports lots of SSE fast paths. One is being added now, see here and it only provides a 5% speed up.
Do you guys see anything that could provide a better speed up in that code? Thanks. _________________ My x86 Instruction Reference -- includes SSE, SSE2, SSE3, SSSE3, SSE4 instructions. Assembly Programmer's Journal |
|||
11 Nov 2009, 02:16 |
|
kohlrak 11 Nov 2009, 02:35
and this is why we always hear the silly arguments about asm only ever being a minor percentage of speed gain...
|
|||
11 Nov 2009, 02:35 |
|
revolution 11 Nov 2009, 02:48
asm used in the proper way can give an order of magnitude speed up. But the trick is to find "the proper way". Not always easy or possible, it depends upon what you are coding.
In the past simple CPU instruction level improvements would give almost linear code performance improvements. In these modern times, CPUs are a lot faster than memory and expecting linear speed-ups is folly unless you know for sure 100% of time is compute bound. |
|||
11 Nov 2009, 02:48 |
|
bitshifter 11 Nov 2009, 02:50
Im still a SSE noob but what i learned is...
1) Cache values only once if you can. 2) Align the data so to use aligned instructions. 3) Perform batch processing whenever possible. Nice stuff though. I like gfx so i intend to study it a while. Thanks for the link. |
|||
11 Nov 2009, 02:50 |
|
kohlrak 11 Nov 2009, 02:55
Quote: asm used in the proper way can give an order of magnitude speed up. But the trick is to find "the proper way". Not always easy or possible, it depends upon what you are coding. Fortunately for people like me, compilers make silly mistakes that really add up in the end (even though you try telling anyone these things they'll say this and that doesn't matter, because it's small). The great wall of china was made with small, small bricks. Quote: Im still a SSE noob but what i learned is... This can be applied to just about everything on the x86. |
|||
11 Nov 2009, 02:55 |
|
r22 11 Nov 2009, 15:03
@mattst88
In an attempt to actually answer your question... After looking at the DIFF (code) you linked to a few things became obvious that could account for the lack-luster speed up. - The "cache_prefetch" intrinsic seems to be overused, it should just be on the outer loop IMHO - If this is for 32bit more XMMX (m128i) locals are declared then there are actual XMMX registers (this is probably causing trouble for the compiler). Local memory is being shuffled about with the registers. - Loop unrolling is standard for SSE optimized code, does the compiler attempt this? Without having the ASM output it's really hard to tell. - 64bit data types are mixed in with the 128bit ones - The indexing is sub-optimal, the code increments the arrays and decrements the counter separately Code: src += 4; dst += 4; mask += 4; w -= 4; Unless the compiler is smart it's likely using this slower convention instead of having 1 indexing variable. - Conditionals: this algorithm has a lot of conditional cases (if...else) (while ?). Branchless algorithms are usually better suited to SIMD optimization. With the ASM output and a better understanding of HLL use of SIMD I would be able to get more detailed (i have a hard time following the naming conventions with _'s and 1x64's and such everywhere). |
|||
11 Nov 2009, 15:03 |
|
Azu 11 Nov 2009, 21:37
r22 wrote: - Conditionals: this algorithm has a lot of conditional cases (if...else) (while ?). Branchless algorithms are usually better suited to SIMD optimization. |
|||
11 Nov 2009, 21:37 |
|
LocoDelAssembly 11 Nov 2009, 22:03
mattst88, if you're able to compile it please use the following:
Code: $gcc whatever -O3 -S -masm=intel Because of the observation made by r22, try to get a 64-bit output (you can force it with "-m64" compiler switch). |
|||
11 Nov 2009, 22:03 |
|
rugxulo 12 Nov 2009, 15:40
It's true, -O3 doesn't bring as much benefit in latest GCCs, even often hurts vs. -O2, unlike earlier GCCs.
BTW, am I the only one who thinks 5% is better than nothing?! It may not sound like much, but it can add up! Saying things like "only 5%? bah" don't help, esp. when improvements are often incremental and not huge. Only over time can you truly improve software. (Rome wasn't built in a day.) |
|||
12 Nov 2009, 15:40 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.