flat assembler
Message board for the users of flat assembler.

Index > Main > loops and branch prediction

Author
Thread Post new topic Reply to topic
sylware



Joined: 23 Oct 2020
Posts: 528
Location: Marseille/France
sylware 25 Feb 2026, 14:53
On current large micro-architectures, what is the less worse compromise to design a loop? Put the whole loop control as local a possible? And if so, before or after the loop inner code? Because you have to choose to fall-thru either into the loop inner code or the code following the actual loop.

Or something completely different?
Post 25 Feb 2026, 14:53
View user's profile Send private message Reply with quote
AsmGuru62



Joined: 28 Jan 2004
Posts: 1774
Location: Toronto, Canada
AsmGuru62 25 Feb 2026, 16:24
The manuals mention these few things:
- loop first opcode should be aligned to 16.
- loop length should be small to fit into some mysterious cache.
- if there is a conditional branch in the loop, in most cases the code should fall through it.

That is all.
It will be curious to test and measure the loops: aligned vs misaligned, large body vs small body.

Also interesting: FASM code almost has no alignment on branches, yet it is crazy fast!
Very cool!
Post 25 Feb 2026, 16:24
View user's profile Send private message Send e-mail Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20885
Location: In your JS exploiting you and your system
revolution 25 Feb 2026, 16:45
fasm is mostly data dependant, not CPU dependant, so the tiny details like loop and branch alignment will be unnoticeable, since they are swamped by the data accesses taking up all the cycles.


Last edited by revolution on 25 Feb 2026, 22:15; edited 1 time in total
Post 25 Feb 2026, 16:45
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4383
Location: vpcmpistri
bitRAKE 25 Feb 2026, 21:01
TL/DR: The key improvement is reducing front-end disruption (BTB/prediction/fetch/decode/op-cache pressure) while keeping the hot path compact and predictable.

To compact and align the inner loop, I typically do something like:
Code:
        jmp .try
        align 16/64 ; reduce cache line use of hot loop
.loop:


.try:
        cmp REG_scan, REG_limit ; reduce memory access in hot loop
        jc .loop    
... this pattern allows bypass and the common predicted back jump.

Examine what instruction fusion is supported by your targets. Typically, the conditional instruction fuses with the flag setting instruction. If you have internal exit conditions, make them branch forward (unpredicted from the start).

The most modern processors can cache post decode ops, and have special handled for very small loops. BTB limitation prevents prediction when several branched are tightly together (ie.
Code:
        test al, 11000000b
        js .1_
        jnz .01
.00:    
... that we might do for size coding is extra bad for performance).

AMD/Intel publish optimization manuals - they are worth a read. Also compiler source code covers a lot of nuance if you want to squeeze the top performance. Also study well tuned libraries.

SIMD execution and memory access often dwarfs branching logic optimization.

Post a pattern you are curious about - I'm sure there are plenty here that could comment on arrangement.
Post 25 Feb 2026, 21:01
View user's profile Send private message Visit poster's website Reply with quote
sylware



Joined: 23 Oct 2020
Posts: 528
Location: Marseille/France
sylware 26 Feb 2026, 01:35
@bitRAKE, you prefer the loop control block after the loop body with a conditional branch taken most of the time (to jump back to the loop body), but @AsmGuru62 pointed out it is better to fall thru to the loop body namely the conditional branch would be taken only once to exit the loop then the loop control block would be before the loop body.

meh.
Post 26 Feb 2026, 01:35
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4383
Location: vpcmpistri
bitRAKE 26 Feb 2026, 02:56
We have greater nuance though. If we need to loop zero times then we need another test prior to the loop, or jump to the condition. One situation doesn't cover all occurrences.

_________________
¯\(°_o)/¯ AI may [not] have aided with the above reply.
Post 26 Feb 2026, 02:56
View user's profile Send private message Visit poster's website Reply with quote
sylware



Joined: 23 Oct 2020
Posts: 528
Location: Marseille/France
sylware 26 Feb 2026, 12:15
Yeah, and for many loops, you can remove the 'empty' test because you know from the semantic of the program that you have at least 1 loop to go thru.

Allright, this is from wikipedia, not x86_64 but RISC-V (sweet spot for modern CPU micro-architectures):

"The RISC-V ISA recommends that software written for RISC-V harts (hardware threads), or generated to run on RISC-V harts, be optimized with the assumption that backward branches are taken and forward branches are not. (Even when the processor incorporates a more-advanced predictor that diminishes the relative value of static prediction.)"

This mean, as a good general rule, exiting the loop should be 'forward' or looping to the loop body entry should be 'backward'. Namely, with a basic branch preditor, you can put the loop control block before or after, you can even split the loop control block as a loop control header and a loop control trailer.

EDIT:

In the case of nested loops, in order to try hard to fit those branch prediction 'static' rules, better start crafting from the deepest loop.
Post 26 Feb 2026, 12:15
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2026, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.