loops and branch prediction

Index > Main > loops and branch prediction

Author

Thread

sylware

Joined: 23 Oct 2020
Posts: 564
Location: Marseille/France

sylware 25 Feb 2026, 14:53

On current large micro-architectures, what is the less worse compromise to design a loop? Put the whole loop control as local a possible? And if so, before or after the loop inner code? Because you have to choose to fall-thru either into the loop inner code or the code following the actual loop.

Or something completely different?

25 Feb 2026, 14:53

AsmGuru62

Joined: 28 Jan 2004
Posts: 1795
Location: Toronto, Canada

AsmGuru62 25 Feb 2026, 16:24

The manuals mention these few things:
- loop first opcode should be aligned to 16.
- loop length should be small to fit into some mysterious cache.
- if there is a conditional branch in the loop, in most cases the code should fall through it.

That is all.
It will be curious to test and measure the loops: aligned vs misaligned, large body vs small body.

Also interesting: FASM code almost has no alignment on branches, yet it is crazy fast!
Very cool!

25 Feb 2026, 16:24

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20957
Location: In your JS exploiting you and your system

revolution 25 Feb 2026, 16:45

fasm is mostly data dependant, not CPU dependant, so the tiny details like loop and branch alignment will be unnoticeable, since they are swamped by the data accesses taking up all the cycles.

Last edited by revolution on 25 Feb 2026, 22:15; edited 1 time in total

25 Feb 2026, 16:45

bitRAKE

Joined: 21 Jul 2003
Posts: 4442
Location: vpcmpistri

bitRAKE 25 Feb 2026, 21:01

TL/DR: The key improvement is reducing front-end disruption (BTB/prediction/fetch/decode/op-cache pressure) while keeping the hot path compact and predictable.

To compact and align the inner loop, I typically do something like:

Code:

        jmp .try
        align 16/64 ; reduce cache line use of hot loop
.loop:


.try:
        cmp REG_scan, REG_limit ; reduce memory access in hot loop
        jc .loop

... this pattern allows bypass and the common predicted back jump.

Examine what instruction fusion is supported by your targets. Typically, the conditional instruction fuses with the flag setting instruction. If you have internal exit conditions, make them branch forward (unpredicted from the start).

The most modern processors can cache post decode ops, and have special handled for very small loops. BTB limitation prevents prediction when several branched are tightly together (ie.

Code:

        test al, 11000000b
        js .1_
        jnz .01
.00:

... that we might do for size coding is extra bad for performance).

AMD/Intel publish optimization manuals - they are worth a read. Also compiler source code covers a lot of nuance if you want to squeeze the top performance. Also study well tuned libraries.

SIMD execution and memory access often dwarfs branching logic optimization.

Post a pattern you are curious about - I'm sure there are plenty here that could comment on arrangement.

25 Feb 2026, 21:01

sylware

Joined: 23 Oct 2020
Posts: 564
Location: Marseille/France

sylware 26 Feb 2026, 01:35

@bitRAKE, you prefer the loop control block after the loop body with a conditional branch taken most of the time (to jump back to the loop body), but @AsmGuru62 pointed out it is better to fall thru to the loop body namely the conditional branch would be taken only once to exit the loop then the loop control block would be before the loop body.

meh.

26 Feb 2026, 01:35

bitRAKE

Joined: 21 Jul 2003
Posts: 4442
Location: vpcmpistri

bitRAKE 26 Feb 2026, 02:56

We have greater nuance though. If we need to loop zero times then we need another test prior to the loop, or jump to the condition. One situation doesn't cover all occurrences.

_________________
¯\(°_o)/¯ AI may [not] have aided with the above reply.

26 Feb 2026, 02:56

sylware

Joined: 23 Oct 2020
Posts: 564
Location: Marseille/France

sylware 26 Feb 2026, 12:15

Yeah, and for many loops, you can remove the 'empty' test because you know from the semantic of the program that you have at least 1 loop to go thru.

Allright, this is from wikipedia, not x86_64 but RISC-V (sweet spot for modern CPU micro-architectures):

"The RISC-V ISA recommends that software written for RISC-V harts (hardware threads), or generated to run on RISC-V harts, be optimized with the assumption that backward branches are taken and forward branches are not. (Even when the processor incorporates a more-advanced predictor that diminishes the relative value of static prediction.)"

EDIT: this rule seems sort of mandatory to follow as hard as possible: after further reading it seems to be the default rule applied to a branch which is not in the BTB, whatever the branch prediction implementation.

This mean, as a good general rule, exiting the loop should be 'forward' or looping to the loop body entry should be 'backward'. Namely, with a basic branch preditor, you can put the loop control block before or after, you can even split the loop control block as a loop control header and a loop control trailer.

EDIT: In the case of nested loops, in order to try hard to fit those branch prediction 'static' rules, better start crafting from the deepest loop.

26 Feb 2026, 12:15

Furs

Joined: 04 Mar 2016
Posts: 2729

Furs 02 Mar 2026, 16:30

sylware wrote:

Yeah, and for many loops, you can remove the 'empty' test because you know from the semantic of the program that you have at least 1 loop to go thru.

Allright, this is from wikipedia, not x86_64 but RISC-V (sweet spot for modern CPU micro-architectures):

"The RISC-V ISA recommends that software written for RISC-V harts (hardware threads), or generated to run on RISC-V harts, be optimized with the assumption that backward branches are taken and forward branches are not. (Even when the processor incorporates a more-advanced predictor that diminishes the relative value of static prediction.)"

EDIT: this rule seems sort of mandatory to follow as hard as possible: after further reading it seems to be the default rule applied to a branch which is not in the BTB, whatever the branch prediction implementation.

This mean, as a good general rule, exiting the loop should be 'forward' or looping to the loop body entry should be 'backward'. Namely, with a basic branch preditor, you can put the loop control block before or after, you can even split the loop control block as a loop control header and a loop control trailer.

EDIT: In the case of nested loops, in order to try hard to fit those branch prediction 'static' rules, better start crafting from the deepest loop.

Such rules are always important to follow because, unlike revolution thinks, you don't build your app for one specific CPU (talking about exact model, not architecture), so using "testing" on a specific CPU to decide is irrelevant. Even if one CPU has this backwards, it's the CPU that's broken and should be fixed not the other way around (and in many cases that's exactly what happens).

"broken" in this sense because it's not what most apps expect/are built.

02 Mar 2026, 16:30

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20957
Location: In your JS exploiting you and your system

revolution 02 Mar 2026, 20:23

I've always said that tests should be conducted on all the systems one is targetting. Then one can weigh and average out the results to get values as a basis to select code paths.

If one generates a single code package, necessarily some systems will run less optimally, maybe all the systems will run less optimally, but overall it might work out better. But without testing one can never know how much performance is being gained/lost.

For the HLL generators, they necessarily must guess and use heuristics to output "fast" code. General rules are fine, and will perform adequately, but it leaves a lot of room for improvement. As an assembly coder one can do much better and selectively optimise and specialise the code. If one wants peak performance benefits then it requires specialisation for each system.

It's great to have general "rules", but are they really better for the most common case? What even is the most common case? Every case has different needs. Each new CPU means new rules. How long will it be until the new CPUs will become the common case? Should the code be updated now when new CPUs are rare, or later, or never?

BTW: I love talking about optimisation, It's been my job for the last 26 years. Razz

Please keep it up, I learn lots from everyone here.

02 Mar 2026, 20:23

sylware

Joined: 23 Oct 2020
Posts: 564
Location: Marseille/France

sylware 03 Mar 2026, 01:03

Got the same rule from intel manuals.

Basically, I got the same "rule" from a R&D paper, intel&RISC-V manuals, and I think I read the same thing for AMD.

This is kind of very generic. Ofc, it would be better to test it on various micro-architectures.

Each micro-architecture has often specific improvements on top of that rule.

For the moment, with this rule, checks for a syscall/external call return value gets very branch-y but the main path is to be fully speculatively executed, and you have to put most checks on top of the syscall/external call. It adds so many unconditional branches, I wonder if getting a mis-predict and pipeline flush would not be better.

03 Mar 2026, 01:03

< Last Thread | Next Thread >

Forum Rules:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum