flat assembler
Message board for the users of flat assembler.

Index > Main > cost of branch mis-prediction

Author
Thread Post new topic Reply to topic
sylware



Joined: 23 Oct 2020
Posts: 543
Location: Marseille/France
sylware 03 Mar 2026, 10:58
This is related to one of my previous post.

What is the cost of a branch mis-prediciton on modern micro-architectures?

I am looking on the web, but for now, I have not been getting anything conclusive.

All that because, crafting some code friendly to branch prediction (using THE static rule) usually means more machine instructions, often unconditional branches.

Could the cost of a branch mis-prediction (which is sure to happen for a first encountered conditional branch which is not in the BHT/BTB and invert of THE static rule) be cheaper than the additional machine instructions?

EDIT: after further reading, it seems we are talking about a 20 clock penalty ~3-5ns on a modern CPU core at full clock speed, namely you would need a ton of additional unconditional branches to start to be more risky/less interesting. But dynamic CPU clocking could stay slow with efficient code and get higher with unefficient code, namely if you really want a fast path being really fast, better tell you OS scheduler than this thread really want all the GHz.
Post 03 Mar 2026, 10:58
View user's profile Send private message Reply with quote
anbyte



Joined: 21 Jul 2024
Posts: 9
anbyte 03 Mar 2026, 12:45
Someone told me about this book “Algorithms for Modern Hardware” by Sergey Slotin recently, and this chapter covers branching. It seems very comprehensive, there is information on a lot of other optimization concepts on modern architectures (and since we're talking about performance engineering, its worth mentioning the Computer, Enhance! course by Casey Muratori).
These seem like good quality information, thought I'd share it for anyone interested.
Post 03 Mar 2026, 12:45
View user's profile Send private message Reply with quote
AsmGuru62



Joined: 28 Jan 2004
Posts: 1781
Location: Toronto, Canada
AsmGuru62 03 Mar 2026, 13:05
I think it is a good project to do: measure the performance impact on "predicted" path vs "unpredicted" one.
But that will not be easy, how to avoid the impact of other instructions?
The book by Slotin is indeed a very nice one.
Post 03 Mar 2026, 13:05
View user's profile Send private message Send e-mail Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20896
Location: In your JS exploiting you and your system
revolution 03 Mar 2026, 13:30
sylware wrote:
What is the cost of a branch mis-prediciton on modern micro-architectures?
Depends upon the CPU, and the internal state it is holding at the time of execution. So anything between a few cycles to many hundreds of cycles.
AsmGuru62 wrote:
... how to avoid the impact of other instructions?
In OoO CPUs, instructions don't execute in isolation. The timing of a single instruction isn't guaranteed to be a fixed value. It is possible to set up special test code to force things to flow in certain patterns and get timings, but those values are useless for anything other than in that code. At best they can be a general indication only.
Post 03 Mar 2026, 13:30
View user's profile Send private message Visit poster's website Reply with quote
sylware



Joined: 23 Oct 2020
Posts: 543
Location: Marseille/France
sylware 03 Mar 2026, 13:50
A bit offtopic:

oooof!

CPU power management is now by default very coarsely driven by the OS.

There is a risk that "efficient" code paths may not trigger CPU Hz increase as "inefficient" code paths may more likely trigger CPU Hz increase... making the "inefficent" code run _faster_ than the "efficient" code!

Scary.

Back on the topic: "measuring" would be useful to increase the chance to detect something is really going wrong, but with all those hardware bugs (for instance SLS), "measuring" may show really weird things... (I guess this is how the speculative execution exploits were discovered).

Those 20 clock cycles penalty seems to be the "norm" on modern micro-architectures.

The bottom of this: the CPU designers did make assumptions about the general design of machine code programs, and the most important is to respect the assumptions very common to all micro-achitectures, for instance THE static branch prediction rule. Ofc, one can push micro-architecture specialization of machine code at runtime or compile-time (very true for RISC-V CPU with very fine grained hardware control extensions).
Post 03 Mar 2026, 13:50
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2026, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.