flat assembler
Message board for the users of flat assembler.

Index > Main > branch prediction

Goto page Previous  1, 2, 3
Author
Thread Post new topic Reply to topic
system error



Joined: 01 Sep 2013
Posts: 670
system error 12 Aug 2017, 00:14
@Furs

Vivik is asking for a simple demo code. Well, since you've already committed 10,000 strong expert words in this thread, he's getting the impression that you're an expert; exactly the effect that you wanted people to believe of you xD

@vivik
Insults is culture-dependant. To some culture, F*** and N*** words is just a family norm. This board can't take all cultural values of different people to be part of the "forum guidelines". It will be unfair to people like Furs.

And what exactly what are you trying to do with those high-level things? In HLL, things like optimization is done automatically by the compilers. Unlike ASM, HLL programmers can't have total control over what optimization course that should be taken in the compiling process, except of course for a few switches at the command line. I guess that's why they're called High-Level languages / compilers?

"branch prediction" as the title suggested, is a pure ASM topic, suggesting that you should see it from ASM point of view. Interpreting "branch prediction" from a HLL view is not the correct approach because of such thick abstraction layers provided by the compiler before they got translated into the machine binaries. Many weird things can happen in-between.

You're probably overthinking it and probably, overdoing it.
Post 12 Aug 2017, 00:14
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 19874
Location: In your JS exploiting you and your system
revolution 12 Aug 2017, 05:51
Furs wrote:
I don't agree with revolution's testing for one reason: I believe in good coding practice more than tests, unless you really want to target one particular CPU. For example, AMD and Intel have vastly different CPUs. Testing on Intel doesn't mean it runs well on AMD, and vice-versa. Even Intel have different CPUs if they have vastly different microarchitectures.
I think you missed the point. Tests are primarily to discover if one of wasting time worrying about things that make no discernible difference. And secondly as a way to optimise those parts that the first tests show is making a difference. I have surprised myself with thinking that some particular part of my code would be the hot section only to discover with testing that some other portion was the real bottleneck. I could've wasted a lot of time trying to optimise the wrong part. One might argue that optimising every part is best but there are only so many hours in each day so we have to prioritise.
Post 12 Aug 2017, 05:51
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2346
Furs 12 Aug 2017, 16:11
system error wrote:
Vivik is asking for a simple demo code. Well, since you've already committed 10,000 strong expert words in this thread, he's getting the impression that you're an expert; exactly the effect that you wanted people to believe of you xD
He's asking demo code for __builtin_expect, but that is not possible. It's a compiler thing -- i.e. it's an internal hint to the compiler. There is no "asm pattern" to show for it, since it doesn't insert any specific asm commands in the output. It transforms the code as GCC sees fit for that purpose.

HLL code is not directly representable in asm -- it may be in some cases (because it's simple and optimal situation is easy to see) but not always. A naive look would think a goto, for instance, will always result in a jump (or conditional jump) but that is obviously not the case. Like I said a goto can even inline its destination (yes, GCC can duplicate basic blocks if it thinks it's worth it) with no branch at all. Or it can inline the destination in one part of the goto and have the "fall through" case actually branch to that part -- if it determines that the basic block from which the goto is issued is more "hot" than the fall through path. You simply can't translate HLL code from a compiler to asm directly.

I know you always babble about me trying to sound smart, but I am a contributor to GCC (even though it's a huge project so I don't know how most of it works, nobody probably does; I only have experience in memory aliasing optimization in GIMPLE, walking virtual SSA defs/uses, and some basic RTL optimizations based on patterns) -- whether you believe that or not, I couldn't care less.

revolution wrote:
I think you missed the point. Tests are primarily to discover if one of wasting time worrying about things that make no discernible difference. And secondly as a way to optimise those parts that the first tests show is making a difference. I have surprised myself with thinking that some particular part of my code would be the hot section only to discover with testing that some other portion was the real bottleneck. I could've wasted a lot of time trying to optimise the wrong part. One might argue that optimising every part is best but there are only so many hours in each day so we have to prioritise.
But what if the person in question is developing a library? Or something with no immediate use but which might be used in hot code in the future?

You can't know what the hot path is (unless the function is large) since you don't know how people would necessarily call it. An allocator designed to be fast doesn't mean it shouldn't be designed with absolute speed just because in your specific use case it makes no difference. For someone who actually does use it in a critical loop it makes all the difference. Wink
Post 12 Aug 2017, 16:11
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 19874
Location: In your JS exploiting you and your system
revolution 12 Aug 2017, 19:05
Furs wrote:
But what if the person in question is developing a library? Or something with no immediate use but which might be used in hot code in the future?

You can't know what the hot path is (unless the function is large) since you don't know how people would necessarily call it. An allocator designed to be fast doesn't mean it shouldn't be designed with absolute speed just because in your specific use case it makes no difference. For someone who actually does use it in a critical loop it makes all the difference. Wink
That is a fair concern. But I think that generic libraries can't be expected to be optimised for anything (except for size perhaps). Speed optimisation is very specific to the platform/system and not generic. Following "best practices" for speed optimisation usually gets you a small amount of boost and a small amount of anti-boost depending upon the system. Because of the specificity of the CPUs and their internal designs there isn't really any generic best practices that have any significant advantages across all systems. And besides, optimising for speed is usually not as important as many people like to make out. Many people waste more time than they ever save. So unless you are doing it just for fun and your boss doesn't mind then you might be better off just leaving it and writing new code for some cool new feature.

Perhaps a few examples:

Align procedure calls and loop entries to the cache line size: Okay, good, except different CPU have different cache line sizes. Plus, this makes the code larger and can push other code out of the cache causing cache thrashing.

Avoid the loop instruction: On some CPUs is makes no difference, thus is pointless. Even on CPUs where loop is slow the extra alternate instructions can cause cache problems with extra size, or crossing a cache line boundary.

Use branch hints: Useless on many CPUs and only serves to bloat code. And even where they have an effect it is tiny so only the most heavily (ab)used code loops (i.e. used continuously for days and days) will see any benefit.

Never use div unless there is no suitable alternative: And naturally the meaning of "suitable" is ambiguous. But anyway, some modern CPUs can do DIV in a separate unit outside of the ALU so that other ALU intensive instructions can execute simultaneously, so you might be missing out on some great optimisation opportunities.
Post 12 Aug 2017, 19:05
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2346
Furs 12 Aug 2017, 20:26
Hah I kind of like your link, though it doesn't really apply to such scenarios.

"Saving time" usually mean you code something as a tool for yourself. But I think nobody does that in asm (?), they probably use a scripting language (not even C/C++) or something along those lines. Well, unless you need some speed that C/C++ provide.

Of course, "saving time" doesn't really apply if you distribute the application and it's not just a "productivity tool". Millions (?) of people will enjoy its performance improvements if you improve it. Also, for some interactive or realtime apps, performance isn't as much about saving time as it is about the app being useable at all. e.g. nobody wants to play a game with 10 FPS instead of 60 FPS (yea, dramatic example, but you get the point), or use a DSP effect with crackles or not even able to preview it in realtime etc.

(obviously talking about the case where the speed difference exists in the first place)

I mean I hate bloat personally but seriously, sometimes I see all these casual apps (like web browsers) taking so damn long to load on some PCs/tablets/whatever or for whatever reason (instead of instant) just because Joe doesn't want to work more to improve the experience of everyone who uses it.

I know the saying "But programmers are far less common than users so their time is more important", but we have too much software that does the same shit, so to me it's a poor excuse. In some cases, alternative software was even born to be "lightweight" compared to another, and then in the end they resort to the exact same bloat! WTF. I'd rather have one insanely optimized software than 5 bloated crap that all do basically the same thing. If all those programmers worked on just 1 instead (and give it enough settings/options to satisfy everyone).

Anyway sorry, off topic rant. Confused
Post 12 Aug 2017, 20:26
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 19874
Location: In your JS exploiting you and your system
revolution 12 Aug 2017, 20:54
Well the point is it is not possible to optimise for everything at once. You have to choose which system you optimise for. It is possible to make many different versions of your app for each different system, but in practice no one does that, except for some very specific software where runtime is absolutely the most important measurement (aside from correctness of course). And the link to time saving still applies for distributed code. Just adjust your "time saved" values across all systems where it runs.

If your app has a really important requirement to be fast then you have to know each system you optimise for, there aren't any shortcuts here. Don't expect to use "best practices" and get good results across the board because you will be disappointed.

And your rant about browsers isn't because the fail to use branch predictions and other tiny micro adjustments, it is because of entirely different reasons.
Post 12 Aug 2017, 20:54
View user's profile Send private message Visit poster's website Reply with quote
system error



Joined: 01 Sep 2013
Posts: 670
system error 12 Aug 2017, 22:25
Furs wrote:
He's asking demo code for __builtin_expect, but that is not possible. Sorry, i can't write code. All I wanted to do was to impress people with my "essays" and third-party quotes so I look smart and important. No code, no proof. No nothing. I am a Circus Monkey. My job is to entertain and to impress


You don't have to punish yourself like that, broh! xD
Post 12 Aug 2017, 22:25
View user's profile Send private message Reply with quote
vivik



Joined: 29 Oct 2016
Posts: 671
vivik 30 Aug 2017, 09:27
Here is an example of likely/unlikely in action, seems to mostly rearrange if/else around. Furs said that gcc may also place all often called functions closer together, it's harder to write an example of that. Glad there are people that actually read gcc docs.

Compiled with this:
g++.exe -S -masm=intel -O3 -fverbose-asm -Wall -std=c++0x -nostdlib -ffreestanding -mconsole -fno-stack-check -fno-stack-protector -mno-stack-arg-probe -fno-inline-functions -fno-exceptions -fno-asynchronous-unwind-tables -c ltalloc.cc

Don't try to actually run this code, by the way. It's only half of what it should be.

I used winmerge to see the difference, because I have no idea how to get diff on windows overwise. I think I managed to get diff in emacs once, but I forgot how.


Description:
Download
Filename: ltalloc_nolikely.c.txt
Filesize: 15.77 KB
Downloaded: 356 Time(s)

Description:
Download
Filename: ltalloc.s.txt
Filesize: 19.4 KB
Downloaded: 364 Time(s)

Description:
Download
Filename: ltalloc.c.txt
Filesize: 14.97 KB
Downloaded: 395 Time(s)

Post 30 Aug 2017, 09:27
View user's profile Send private message Reply with quote
vivik



Joined: 29 Oct 2016
Posts: 671
vivik 30 Aug 2017, 09:28
>Attachment cannot be added, since the max. number of 3 Attachments in this post was achieved


Description:
Download
Filename: ltalloc_nolikely.s.txt
Filesize: 9.56 KB
Downloaded: 852 Time(s)

Post 30 Aug 2017, 09:28
View user's profile Send private message Reply with quote
vivik



Joined: 29 Oct 2016
Posts: 671
vivik 03 Sep 2017, 08:30
Would be awesome if gcc generated a warning if likely/unlikely in code contradicted the actual benchmarking results. Wonder if gcc is smart enough for that.
Post 03 Sep 2017, 08:30
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 19874
Location: In your JS exploiting you and your system
revolution 03 Sep 2017, 08:40
vivik wrote:
Would be awesome if gcc generated a warning if likely/unlikely in code contradicted the actual benchmarking results. Wonder if gcc is smart enough for that.
Does GCC do some internal benchmarking of procedures? How would it know the typical input patterns?
Post 03 Sep 2017, 08:40
View user's profile Send private message Visit poster's website Reply with quote
vivik



Joined: 29 Oct 2016
Posts: 671
vivik 03 Sep 2017, 09:01
@revolution
Profiling, not benchmarking, used the wrong word. But yeah, gcc can, optionally. It actually recommends to use it instead of setting likely/unlikely manually.

CFLAGS_PROFILE=-g -pg -ggdb -fprofile-arcs
CFLAGS_RELEASE_PROFILE=-fbranch-probabilities

I would like to still have a direct control other this, but still have gcc around to show me when I made a mistake. Because to make a good profiling hints, I need to execute 100% of program before recompiling it. It looks troublesome, it requires to write special coverage tests, which will be artificial by nature and wouldn't reflect the real usage of program.
Post 03 Sep 2017, 09:01
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2346
Furs 03 Sep 2017, 11:55
revolution wrote:
Does GCC do some internal benchmarking of procedures? How would it know the typical input patterns?
It sort of does something similar (I know it's not what you asked for), based on the "frequency" a function gets called/used or a basic block. Of course this information is small since it can't know it at compile-time (without hints). I find profile-guided optimizations a pain since you'll have to use them on every final compilation (since code can change).
Post 03 Sep 2017, 11:55
View user's profile Send private message Reply with quote
vivik



Joined: 29 Oct 2016
Posts: 671
vivik 05 Sep 2017, 09:26
@Furs
Hm, profiling should count every branching (every if and for), not only function calls. I say that because I seen things like loops in profiler reports.
Post 05 Sep 2017, 09:26
View user's profile Send private message Reply with quote
DimonSoft



Joined: 03 Mar 2010
Posts: 1228
Location: Belarus
DimonSoft 05 Sep 2017, 09:58
system error wrote:
^
^

as usual, ALL TALK. No proof. Wait till that guy above me quoting something from Microsoft / Intel to make him looks "smart" and "important" while he himself remain largely clueless and probably takes 2 full threads to finally understand the simplest thing even a Gorilla can understand! xD

Wasn’t that you in another thread with the same rude and ignorant posts? You have the opposite information—feel free to share the links to the documentation/specifications. You don’t—feel free not to write something useless.
Post 05 Sep 2017, 09:58
View user's profile Send private message Visit poster's website Reply with quote
vivik



Joined: 29 Oct 2016
Posts: 671
vivik 05 Sep 2017, 16:38
jesus, do it in private messages
Post 05 Sep 2017, 16:38
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2023, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.