flat assembler
Message board for the users of flat assembler.

Index > Main > Only seeing a 5% speed up with SSE2, please review

Author
Thread Post new topic Reply to topic
mattst88



Joined: 12 May 2006
Posts: 260
Location: South Carolina
mattst88 11 Nov 2009, 02:16
The library pixman used on Linux for compositing operations supports lots of SSE fast paths. One is being added now, see here and it only provides a 5% speed up.

Do you guys see anything that could provide a better speed up in that code?

Thanks.

_________________
My x86 Instruction Reference -- includes SSE, SSE2, SSE3, SSSE3, SSE4 instructions.
Assembly Programmer's Journal
Post 11 Nov 2009, 02:16
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 11 Nov 2009, 02:28
SSE is not a panacea for speed. You have to consider the underlying memory subsystem (including caches). It may be that the algorithm used is mostly memory bound, and thus, even if the CPU could compute things instantly, your speed up would still be minimal. I haven't looked at the code, so I can't say if it is memory bound or not, but that would be my first guess.
Post 11 Nov 2009, 02:28
View user's profile Send private message Visit poster's website Reply with quote
kohlrak



Joined: 21 Jul 2006
Posts: 1421
Location: Uncle Sam's Pad
kohlrak 11 Nov 2009, 02:35
and this is why we always hear the silly arguments about asm only ever being a minor percentage of speed gain...
Post 11 Nov 2009, 02:35
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 11 Nov 2009, 02:48
asm used in the proper way can give an order of magnitude speed up. But the trick is to find "the proper way". Not always easy or possible, it depends upon what you are coding.

In the past simple CPU instruction level improvements would give almost linear code performance improvements. In these modern times, CPUs are a lot faster than memory and expecting linear speed-ups is folly unless you know for sure 100% of time is compute bound.
Post 11 Nov 2009, 02:48
View user's profile Send private message Visit poster's website Reply with quote
bitshifter



Joined: 04 Dec 2007
Posts: 796
Location: Massachusetts, USA
bitshifter 11 Nov 2009, 02:50
Im still a SSE noob but what i learned is...
1) Cache values only once if you can.
2) Align the data so to use aligned instructions.
3) Perform batch processing whenever possible.

Nice stuff though.
I like gfx so i intend to study it a while.
Thanks for the link. Smile
Post 11 Nov 2009, 02:50
View user's profile Send private message Reply with quote
kohlrak



Joined: 21 Jul 2006
Posts: 1421
Location: Uncle Sam's Pad
kohlrak 11 Nov 2009, 02:55
Quote:
asm used in the proper way can give an order of magnitude speed up. But the trick is to find "the proper way". Not always easy or possible, it depends upon what you are coding.

In the past simple CPU instruction level improvements would give almost linear code performance improvements. In these modern times, CPUs are a lot faster than memory and expecting linear speed-ups is folly unless you know for sure 100% of time is compute bound.


Fortunately for people like me, compilers make silly mistakes that really add up in the end (even though you try telling anyone these things they'll say this and that doesn't matter, because it's small). The great wall of china was made with small, small bricks.

Quote:
Im still a SSE noob but what i learned is...
1) Cache values only once if you can.
2) Align the data so to use aligned instructions.
3) Perform batch processing whenever possible.


This can be applied to just about everything on the x86.
Post 11 Nov 2009, 02:55
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 11 Nov 2009, 15:03
@mattst88

In an attempt to actually answer your question...
After looking at the DIFF (code) you linked to a few things became obvious that could account for the lack-luster speed up.

- The "cache_prefetch" intrinsic seems to be overused, it should just be on the outer loop IMHO

- If this is for 32bit more XMMX (m128i) locals are declared then there are actual XMMX registers (this is probably causing trouble for the compiler). Local memory is being shuffled about with the registers.

- Loop unrolling is standard for SSE optimized code, does the compiler attempt this? Without having the ASM output it's really hard to tell.

- 64bit data types are mixed in with the 128bit ones

- The indexing is sub-optimal, the code increments the arrays and decrements the counter separately
Code:
src += 4;
dst += 4;
mask += 4;
w -= 4;
    

Unless the compiler is smart it's likely using this slower convention instead of having 1 indexing variable.

- Conditionals: this algorithm has a lot of conditional cases (if...else) (while ?). Branchless algorithms are usually better suited to SIMD optimization.


With the ASM output and a better understanding of HLL use of SIMD I would be able to get more detailed (i have a hard time following the naming conventions with _'s and 1x64's and such everywhere).
Post 11 Nov 2009, 15:03
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Azu



Joined: 16 Dec 2008
Posts: 1159
Azu 11 Nov 2009, 21:37
r22 wrote:
- Conditionals: this algorithm has a lot of conditional cases (if...else) (while ?). Branchless algorithms are usually better suited to SIMD optimization.
Were it branchless wouldn't GPGPU be a better solution? I thought GPUs obliterated CPUs at branchless stuff..

_________________
Post 11 Nov 2009, 21:37
View user's profile Send private message Send e-mail AIM Address Yahoo Messenger MSN Messenger ICQ Number Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 11 Nov 2009, 22:03
mattst88, if you're able to compile it please use the following:
Code:
$gcc whatever -O3 -S -masm=intel    
Then copy the part that belongs to sse2_composite_over_8888_8_8888 here. If it is not very large then also post using "-O2" just in case the loop unrolling made by gcc makes things worse.

Because of the observation made by r22, try to get a 64-bit output (you can force it with "-m64" compiler switch).
Post 11 Nov 2009, 22:03
View user's profile Send private message Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2341
Location: Usono (aka, USA)
rugxulo 12 Nov 2009, 15:40
It's true, -O3 doesn't bring as much benefit in latest GCCs, even often hurts vs. -O2, unlike earlier GCCs.

BTW, am I the only one who thinks 5% is better than nothing?! It may not sound like much, but it can add up! Saying things like "only 5%? bah" don't help, esp. when improvements are often incremental and not huge. Only over time can you truly improve software. (Rome wasn't built in a day.)
Post 12 Nov 2009, 15:40
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.