flat assembler
Message board for the users of flat assembler.
 Home   FAQ   Search   Register 
 Profile   Log in to check your private messages   Log in 
flat assembler > Main > branch prediction

Goto page Previous  1, 2, 3  Next
Author
Thread Post new topic Reply to topic
Furs



Joined: 04 Mar 2016
Posts: 814
So you ask for proof, and you don't want me to link anything? And yet if I tell you something in a post, it's "all talk". So how do you want a proof, exactly?

Also, proof about what? Branch mispredictions? The fact that xor rdx, rdx is one byte longer than xor edx, edx and does the exact same thing? (this one is easy, compile with FASM and see for yourself). That inc/dec are better than add/sub reg,1 because they use one less byte and have better cache utilization? That Intel Compiler emits inc/dec? (see on gcc.godbolt.org for yourself) Or what?

If you want to read links I can obviously provide some. Wink
Post 10 Aug 2017, 12:00
View user's profile Send private message Reply with quote
system error



Joined: 01 Sep 2013
Posts: 667
Incompetent Furs, it's hard to explain things like these to non-technical guy with BIG MOUTH like you.

Branch Mispredictions has nothing much to do with an instruction size. In fact I am intentionally using LONGER and WIDER T-case+ MUL instruction in my example to prove that it still displays the better performance than just a short code taken with a jump.

Proof to your MONKEY FACE, just like your MONKEY BRAIN wants to understand it:


Code:
format elf64 

public main
main:
        mov     rcx,100_000_000
@@:     mov     rax,5 ;-5       ;test value     
        call    t_biased_function
        sub     rcx,1
        jnz     @b
        ret

;------------------------------------
;In favor of T
;Design: RAX is around +ve value
;------------------------------------
t_biased_function:
        mov     rbx,10
        cmp     rax,0     
        jl      .ok             ;don't take the jump
        xor     edx,edx     ;I changed this to suit your LOW COMPETENCY in understanding the subject.
        mul     rbx
        nop
        nop             
.ok:    inc eax ;I changed this to suit your LOW COMPETENCY in understanding the subject being discussed.               
        ret




The result

Code:
perf stat -B ./testcode
0.21xxxx secs for T (true)
0.26xxxx secs for F (false)



This despite the facts that;

i) T-case has far more codes
ii) T-case has an obvious dependency chain (MUL/ADD)
iii) T-case has a MUL instruction

See, I have to repeat the twice just to make you understand what people are actually discussing here.

See the test results? Does that make any difference to your MONKEY BRAIN?

Hahahaha xD
Post 10 Aug 2017, 12:18
View user's profile Send private message Reply with quote
system error



Joined: 01 Sep 2013
Posts: 667
Already 1 hour has passed, just like always, Incompetent Furs will go into hiding and trying to read as many books as possible to give him the "slightest of idea" on what BIG MOUTH respond to write next.

Hello????

hahahaha xD
Post 10 Aug 2017, 12:22
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 814

system error wrote:
Branch Mispredictions has nothing much to do with an instruction size.

Show me where I said it does.

Branches, however, do (but not in this case, since code is too short anyway) -- since a taken branch has to pull code from a different cacheline if it's too far away and possibly out of cache. My point was that your code is awfully bad, not for branch misprediction, but in general. It's bad assembly programming practice. You know it's bad when a HLL compiler will always generate better code than you. Why even code in asm lol?

The "better performance" comes from a branch misprediction because it has to flush the entire pipeline, which is what I already said. Nobody denies that a branch misprediction is bad, wtf. I don't think you understand how slow a branch misprediction is. Look at Agner's Optimization pdf linked earlier by vivik or me (yeah, I/we link stuff, not for you). A mul is nothing in comparison with that. In fact, it gets executed out of order anyway, so who cares?

Oh and maybe if you were active in the Heap you're realize I have more important shit to do than argue with a braindead child.

Anyway, I'm done arguing with you in general, thankfully this is not PM and vivik knows how clueless you are. If I am to post links, I will do so for other members of this forum, not for you, so there's one thing you can do about it: deal with it and fuck off. Members who are actual programmers not man-children, and who genuinely want to learn. (did I tell you I really hate 10 year old kids? yeah, editing quotes and such is what they do btw; I know you said women can't code, but script kids can't code either Wink)
Post 10 Aug 2017, 12:41
View user's profile Send private message Reply with quote
system error



Joined: 01 Sep 2013
Posts: 667

Furs wrote:
Show me where I said it does.



Oops! Incompetent!


Furs wrote:
Also, proof about what? Branch mispredictions? The fact that xor rdx, rdx is one byte longer than xor edx, edx and does the exact same thing?



No monkey! Branch prediction has nothing to do with an instruction that is a byte longer. You're way off topic. Climbing the wrong tree, again?! Hahahaha xD


Quote:
The "better performance" comes from a branch misprediction because it has to flush the entire pipeline, which is what I already said. I don't think you understand how slow a branch misprediction is.



I've proven it twice with codes and test results. Two bitch-slapping in your MONKEY FACE! You don't have to quote "agner fog" just to make yourself look "smart" and "important".

You're such a pathetic loser! xD


Quote:
Anyway, I'm done arguing with you in general, thankfully this is not PM and vivik knows how clueless you are. If I am to post links, I will do so for other members of this forum, not for you, so there's one thing you can do about it: fuck off. Members who are actual programmers not man-children, and who genuinely want to learn. (did I tell you I really hate 10 year old kids? yeah, editing quotes and such is what they do btw; I know you said women can't code, but kids can't code either Wink)



blah blah blah again? Are you really that sick?
Post 10 Aug 2017, 12:49
View user's profile Send private message Reply with quote
system error



Joined: 01 Sep 2013
Posts: 667
Helloooo ,,, Furs??? Are you there?? It's been one hour already, bro! Are you ok back there?

hahahaha xD
Post 10 Aug 2017, 12:53
View user's profile Send private message Reply with quote
vivik



Joined: 29 Oct 2016
Posts: 174
sorry, not gonna read last 3 posts.

I deleted all likely/unlikely from one program, and trying to see what changed in assembly. There are a few new warnings now, so probably the program behaviour changed. Too lazy/tired to dig into this right now.
Post 10 Aug 2017, 12:53
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 814

vivik wrote:
I deleted all likely/unlikely from one program, and trying to see what changed in assembly. There are a few new warnings now, so probably the program behaviour changed. Too lazy/tired to dig into this right now.

Warnings? There shouldn't be any. Would you show the code? (including the likely/unlikely macros -- btw they have to be macros, inline functions won't cut it due to how GCC works, last time I tested on 6.4.0)

Likely/Unlikely deals with how the compiler transforms your code. It's not really useful to analyze the code because it depends on surrounding code, but in general, a "not taken" branch is more likely -- so use that as info if it helps. (this is a CPU thing)

Note that likely/unlikely macros are not the only way to give such hints to the compiler. You can use profile-guided optimization for the same thing.

Lastly, I think unlikely is a very useful macro, contrary to what a certain someone else said. That's because you can use it in all paths leading to error stuff -- things you expect to not happen and even if they do, the result is an error. Such cases will never need performance.
Post 10 Aug 2017, 13:24
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 814

vivik wrote:
I deleted all likely/unlikely from one program, and trying to see what changed in assembly. There are a few new warnings now, so probably the program behaviour changed. Too lazy/tired to dig into this right now.

Warnings? There shouldn't be any. Would you show the code? (including the likely/unlikely macros -- btw they have to be macros, inline functions won't cut it due to how GCC works, last time I tested on 6.4.0)

Likely/Unlikely deals with how the compiler transforms your code. It's not really useful to analyze the code because it depends on surrounding code, but in general, a "not taken" branch is more likely -- so use that as info if it helps. (this is a CPU thing)

Note that likely/unlikely macros are not the only way to give such hints to the compiler. You can use profile-guided optimization for the same thing.

Lastly, I think unlikely is a very useful macro, contrary to what a certain someone else said. That's because you can use it in all paths leading to error stuff -- things you expect to not happen and even if they do, the result is an error. Such cases will never need performance. For example even if you use a high-performance memory allocator, use "unlikely" on the failure code (i.e. NULL pointer return).

@system error: In all seriousness, I have no idea what you're talking about with that quote? It was a list of questions of what, specifically, you wanted proof of out of my post. Each of them was totally unrelated, but each was for better code (in general, not this case). So w/e, learn some english. Rolling Eyes

Also, it's not branches that are slow, it's conditional branches. Because it's the misprediction that is slow, not the branch itself. Replace it with jmp and see.
Post 10 Aug 2017, 13:26
View user's profile Send private message Reply with quote
vivik



Joined: 29 Oct 2016
Posts: 174
@Furs
Don't worry, I fixed it. Should've left () around macro.
You can look at this though https://github.com/r-lyeh/ltalloc
I'm reworking this thing at turtle's speed.

Hints about how to comprehend a fully optimized gcc output would be helpful.
Post 10 Aug 2017, 13:45
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 814
Interesting project, I like your rationale for it.

Unfortunately, I can't give out "easy" tips to understand GCC's internal optimization passes. The problem is that they change, a lot; sometimes (rarely) even regress. If using HLL, all you can do is "hope" the compiler will use the info properly, it's not worth to fight with it to generate proper code unless you really want to stick with a specific version of it.

GCC (like any compiler I think?) uses "basic blocks" for code with branches. A basic block has no branches, at all (excluding random exceptions). What it does is it marks certain basic blocks are cold or hot -- but with various "frequency" variables under-the-hood.

unlikely() can tell it to mark that BB cold, but a cold function also does it. Cold functions are also placed in a totally separate place in the binary (away from hot functions) to improve locality for hot functions. They are also automatically optimized for size instead of speed. But more importantly, BBs calling cold functions may be marked cold by themselves.

Note that GCC reorders them based on how it thinks it's best, not how the CPU actually works -- the architecture-specific portions are way lower level during RTL scheduling pass.

You can see how GCC transforms your code by using -fdump-tree-all which dumps GIMPLE (the "higher level" representation, which is architecture-independent). It dumps a LOT of files, each file is the output of a specific pass. GCC runs hundreds of passes for each function. It is C-like code, but you should understand a bit about it (read GCC's internals on GIMPLE, the introduction at least). Note that later GCC passes operate on SSA form -- each variable is only assigned once (and many copies are created). This is to aid optimization passes as they can just walk the SSA uses or definitions (I've personally made a GCC plugin for this, it's not hard when you understand how it works but unfortunately that's the harder part).

The "optimized" GIMPLE pass (that's literally how it is called) shows the final GIMPLE output before it's lowered to RTL. The first RTL pass is "expand" (expands into RTL insns and exprs).

-fdump-rtl-all dumps the RTL passes (LISP-like, with a lot of pointers), which kind of sucks to interpret as it's chaotic (for a non LISP-er like me). It's the lower-level representation and the one before assembly. But anyway, GIMPLE should show you the BBs and how they get transformed I believe. The final RTL pass before assembly is called "final".

BTW When I say "architecture-specific" I mean CPU internal architecture, such as Haswell, etc... not x86. Of course x86-specific stuff happens during the first RTL pass. Each x86 CPU is different though.
Post 10 Aug 2017, 14:14
View user's profile Send private message Reply with quote
vivik



Joined: 29 Oct 2016
Posts: 174
It's not mine project though, just was looking for some allocator for windows. I'm currently trying to understand it.

I'll look into this, timing for that is pretty great. Thanks.
Post 10 Aug 2017, 14:45
View user's profile Send private message Reply with quote
system error



Joined: 01 Sep 2013
Posts: 667

Furs wrote:
Also, it's not branches that are slow, it's conditional branches. Because it's the misprediction that is slow, not the branch itself. Replace it with jmp and see.



Of course Captain Obvious! My test code was all about branch misprediction that can be avoided by applying biased strategy and NEVER about instruction length (as you INCOMPETENTLY suggested).

In the end we can be sure that all you do was to try to make yourself look "important" and "smart" with lots of quotes, BLAH2 and most certainly, no code.

I also read your a comment from you "teaching" Tomasz Grysztar that "xor rdx,rdx" is not optimized and longer than "xor edx,edx" while it was completely irrelevant. Seriously, are you really that DESPERATE to make yourself look smart and relevant while in fact that you're NOT?
Post 10 Aug 2017, 17:24
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 814
^ indeed, so what's your problem? I didn't post in this thread to treat you "in a special way" even though you seem to have "special needs". Yes, it was unrelated, but it's an assembly forum, so what do you expect? Notice how other people also tend to post tips or "small optimization tips" like these (including missed tail-call optimizations and the like).

I mean, did I look like I wanted an argument? Are you mad on facts or something? xor edx, edx is better (as HLLs are even better here) and that applies regardless of who uses xor rdx, rdx -- Tomasz or you or someone else. I don't appeal to authority, but at least Tomasz is respectable since he doesn't throw fits when he makes a mistake. Why are you so fucking mad over it? Get a grip.

I even re-read my post right now, and nowhere was I even offensive in it, only pointed out facts that you seem so mad on for no fucking reason.

The only excuse you could have is that I said the sub/add reg,1 is "stupid antiquated practice" which you could misinterpret (if you english is that bad) as calling you stupid. I mean I even said "please use inc/dec", is that not polite enough for a retard like yourself?

The reason I said the add/sub thing is because I'm sick of people online (on stackoverflow especially, we don't all visit just this forum you know) having misinformation about partial stalls when they only apply to P4. I'm tired of this misinformation being spread around, so I point it out like that. Those people, though, are eager to accept they are wrong -- the problem is that old books and stuff like that is what they learn from. It's not an argument over there (they happily accept being proven wrong and educated), but it's just too often a mistake.

I mean, dude, it's assembly. The whole point of it is to generate better code than what a HLL would. Don't teach bad assembly programming practice, that's all.

(GCC is the only compiler that still uses add/sub reg,1 because it still targets P4 and they're too lazy to change the pattern matcher, anyway some of GCC folks' decision are questionable, obviously Intel Compiler is made by Intel and they know their processors better -- I've personally developed my own GCC plugin to make tweaks like these, including replace add/sub with inc/dec)


Anyway -- back to topic now, I never intended to argue with you in the first place (heck I didn't even know it was you who posted it initially) -- so have fun.
Post 10 Aug 2017, 20:20
View user's profile Send private message Reply with quote
system error



Joined: 01 Sep 2013
Posts: 667
@incompetent Furs

No monkey. People here not talking about instruction length, so didn't the thread where you tried to teach Thomas about instruction length when his main focus was 128-bit integers and algorithm! Not relevant at all. See what your real problem now? You tried so hard to look 'smart' by squeezing in into highly-techical discussions by offering pathetic advices competely irrelevant to the subject matter.

So when adults talk, circus monkey like you should just stand down so that you don't take attentions away from useful information being presented by the real experts like Thomas Gryzstar. You are actually a sick person with a very serious narsicissm issues. Worse yet, you have nothing useful to offer, no codes to proof anything. All you do is monkeying around under the pretense of an 'expert' or somebody important by quoting Microsoft and Intel + your garbage rantings. xD
Post 10 Aug 2017, 23:16
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15152
Location: GW170817

Furs wrote:
Note that most CPUs assume the branch by default is not taken.

I think it depends upon the branch direction. Forward branches with no entry in the BTB are predicted as not taken, and backward branches as taken. Well, that is how the rules used to be, perhaps newer CPUs have changed this behaviour?

Anyhow, as usual, a proper set of tests in the actual code would tell us if it makes any measurable difference. It is easy to make up some specific code example that shows a major difference, but that tells us nothing about what happens in the production code.
Post 11 Aug 2017, 03:58
View user's profile Send private message Visit poster's website Reply with quote
vivik



Joined: 29 Oct 2016
Posts: 174
On other forums creating shitstorm in threads like that is bannable, surprised it's okay here.

So, can somebody show real code examples? You seem like you have time on your hands.
Post 11 Aug 2017, 10:01
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 814

revolution wrote:
I think it depends upon the branch direction. Forward branches with no entry in the BTB are predicted as not taken, and backward branches as taken. Well, that is how the rules used to be, perhaps newer CPUs have changed this behaviour?

Yeah I forgot that part, you're right but AFAIK it depends on the micro-op fused combinations (typically cmp/jxx, test/jxx and probably dec/jxx) since they're mostly used in "loops". (actually I don't know if test reg, reg, but probably it uses the same micro-op as cmp with zero). So it's best to assume backwards jumps are taken by default.

@vivik: Real code examples of? Just keep as a rule of thumb that forward conditional branches are assumed (by default) to not be taken, so hot paths should be "fall through". And for backwards branches (or at least close ones < 127 bytes?) I think it assumes it jumps and doesn't fall through (i.e. loops). Especially if preceeded by cmp/test/dec or similar, because those combination of insns get micro-fused into one micro-op.

The problem is that "I think" is the best advice I can give, because many CPUs are different, so it's best to just stick to "best coding practice" for this reason.

I don't agree with revolution's testing for one reason: I believe in good coding practice more than tests, unless you really want to target one particular CPU. For example, AMD and Intel have vastly different CPUs. Testing on Intel doesn't mean it runs well on AMD, and vice-versa. Even Intel have different CPUs if they have vastly different microarchitectures.

So while tests are indeed good in a way, it's not a feasible way for programs we intend to go "in the wild" IMO Confused One easy example: the loop insn is very, very slow on Intel CPUs (still useful for size optimizations) but it's pretty damn fast on AMD CPUs (still slower than a micro-fused conditional branch, but it's only 2 micro-ops, not 9 like on Intel). j[er]cxz is actually fast and the best way if you want flags preserved -- but it's not micro-fused so it uses 2 micro-ops also internally. test ecx, ecx / jxx is one micro-op since it's fused, but it destroys the flags. By AMD CPUs I also mean Ryzen of course. (source: Agner's Instruction Tables pdf in the link)

Also keep in mind that the more a branch is taken or not taken, the more the CPU will predict that the next time. So even if something is mispredicted the first time, as long as you know it will be taken 99% of the time it should be fine. The CPU will "figure it out" soon enough.


EDIT: Also, yeah, this board is too lenient -- you can see since system error is still around. I've insulted him my fair share in the past, but that's after at least 10 or more of his insulting posts with his childish tantrums. I don't care about his nonsense anymore, but if he gets banned I won't lie I'll feel better (even if just temporarily) just so he can learn to behave, not because of myself. If this child was on stackoverflow he'd get instantly banned.

For example, @system error, go create a shitstorm here too: https://board.flatassembler.net/topic.php?t=19651 since you even posted in the thread -- for that guy offering a "unrelated" optimization tip. What's your problem then??!? Hypocrite.

Those "unrelated" optimization tips are useful for beginners who browse the board. On the other hand your childish bullshit is pure noise nobody cares of so keep it to yourself. Those unrelated optimization tips are also allowed on stackoverflow, but I dare you to post your bullshit on there and let's see what happens.
Post 11 Aug 2017, 11:31
View user's profile Send private message Reply with quote
vivik



Joined: 29 Oct 2016
Posts: 174
Can you stop feeding? You're just as bad. Call him a faggot once and stop conversation, it's pretty useless anyway.

By code example I mean, it would be great to see how __builtin_expect actually affects generated assembly. It's surprisingly hard to do, -O3 managed to optimize lots of ifs away. This guy http://blog.man7.org/2012/10/how-much-do-builtinexpect-likely-and.html added a lot of weird calculations in, just to keep compiler from optimizing everything away.

Interestingly, one of comments reported that newer version of gcc doesn't get affected by those at all. And I got simular result, correct and uncorrect prediction resulted in a somewhat simular time.

I'm working at simplifying ltalloc to the point where I can find interesting parts of assembly, it will take some time.
Post 11 Aug 2017, 14:51
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 814
But __builtin_expect is not about branch prediction mainly, as I said. Yes it's very difficult, especially because it differs between GCC versions, so what's the point to understand it?

I don't know how builtin_expect works exactly, but I know it is used to transform your code since there's plenty of functions in GCC's source code that deal with "frequencies" in basic blocks (i.e. how frequent such basic block is supposed to be visited, arbitrary GCC internal metric). "Transforming your code" does not mean at the instruction-level. For example, GIMPLE has nothing to do with any ISA, so all transformations are done at an abstract high level (this is also where most ifs get removed). RTL is less abstract and does more arch-specific optimizations, but I don't think any of them have to do with "explicit" branch prediction.

In general I think GCC reorders code so that it places hot paths near each other to improve code locality (I guess my "shorter insns" thing can be helpful with this idea), just like it places hot functions near each other (in same .text.hot section, while it places cold functions into .text.unlikely). __builtin_expect is not necessarily about branch prediction by itself; like I said many other things label basic blocks as "hot" in GCC.

Of course, it also re-orders code to follow the typical branch prediction assumptions -- forward jumps have hot path as fall-through, backward jumps have hot path as taken. This is not always the case though since in some cases it is literally impossible -- it's possible it simply "votes" each basic block depending on its frequency. (I know the register allocator uses votes for each register class)

Why are you so interested in __builtin_expect from an asm point of view? It's a high level construct used to help the compiler generate better code from HLL. You have to understand that the "flow of control" of a HLL is absolutely not the same as asm. I mean logically it is the same but not under-the-hood. Nothing requires GCC (or any compiler) to put the basic blocks in the same order you see them in a HLL. Even gotos could be moved. A goto could be inlined with its destination and then the original "fall through" case actually jump to that goto! (most of this is done in GIMPLE)

As a matter of fact, you can even mark LABELS as hot by yourself, if they are called from something GCC is unaware of, such as inline asm (GCC does not parse the asm at all, only the operands).

Of course, such label is jumped from the asm (you use "asm goto" for it) and thus GCC cannot possibly affect the branch prediction at all (can't change your asm, since it doesn't parse it). However, this does imply that a "hot" basic block is way more than just about branch prediction. In a large switch statement, it could place all hot basic blocks (labels) near each other, again to improve locality.
Post 11 Aug 2017, 17:06
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3  Next

< Last Thread | Next Thread >

Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Powered by phpBB © 2001-2005 phpBB Group.

Main index   Download   Documentation   Examples   Message board
Copyright © 2004-2016, Tomasz Grysztar.