flat assembler
Message board for the users of flat assembler.
Index
> Main > loops and branch prediction |
| Author |
|
|
AsmGuru62 25 Feb 2026, 16:24
The manuals mention these few things:
- loop first opcode should be aligned to 16. - loop length should be small to fit into some mysterious cache. - if there is a conditional branch in the loop, in most cases the code should fall through it. That is all. It will be curious to test and measure the loops: aligned vs misaligned, large body vs small body. Also interesting: FASM code almost has no alignment on branches, yet it is crazy fast! Very cool! |
|||
|
|
revolution 25 Feb 2026, 16:45
fasm is mostly data dependant, not CPU dependant, so the tiny details like loop and branch alignment will be unnoticeable, since they are swamped by the data accesses taking up all the cycles.
Last edited by revolution on 25 Feb 2026, 22:15; edited 1 time in total |
|||
|
|
bitRAKE 25 Feb 2026, 21:01
TL/DR: The key improvement is reducing front-end disruption (BTB/prediction/fetch/decode/op-cache pressure) while keeping the hot path compact and predictable.
To compact and align the inner loop, I typically do something like: Code: jmp .try align 16/64 ; reduce cache line use of hot loop .loop: .try: cmp REG_scan, REG_limit ; reduce memory access in hot loop jc .loop Examine what instruction fusion is supported by your targets. Typically, the conditional instruction fuses with the flag setting instruction. If you have internal exit conditions, make them branch forward (unpredicted from the start). The most modern processors can cache post decode ops, and have special handled for very small loops. BTB limitation prevents prediction when several branched are tightly together (ie. Code: test al, 11000000b js .1_ jnz .01 .00: AMD/Intel publish optimization manuals - they are worth a read. Also compiler source code covers a lot of nuance if you want to squeeze the top performance. Also study well tuned libraries. SIMD execution and memory access often dwarfs branching logic optimization. Post a pattern you are curious about - I'm sure there are plenty here that could comment on arrangement. |
|||
|
|
sylware 26 Feb 2026, 01:35
@bitRAKE, you prefer the loop control block after the loop body with a conditional branch taken most of the time (to jump back to the loop body), but @AsmGuru62 pointed out it is better to fall thru to the loop body namely the conditional branch would be taken only once to exit the loop then the loop control block would be before the loop body.
meh. |
|||
|
|
bitRAKE 26 Feb 2026, 02:56
We have greater nuance though. If we need to loop zero times then we need another test prior to the loop, or jump to the condition. One situation doesn't cover all occurrences.
_________________ ¯\(°_o)/¯ AI may [not] have aided with the above reply. |
|||
|
|
sylware 26 Feb 2026, 12:15
Yeah, and for many loops, you can remove the 'empty' test because you know from the semantic of the program that you have at least 1 loop to go thru.
Allright, this is from wikipedia, not x86_64 but RISC-V (sweet spot for modern CPU micro-architectures): "The RISC-V ISA recommends that software written for RISC-V harts (hardware threads), or generated to run on RISC-V harts, be optimized with the assumption that backward branches are taken and forward branches are not. (Even when the processor incorporates a more-advanced predictor that diminishes the relative value of static prediction.)" This mean, as a good general rule, exiting the loop should be 'forward' or looping to the loop body entry should be 'backward'. Namely, with a basic branch preditor, you can put the loop control block before or after, you can even split the loop control block as a loop control header and a loop control trailer. EDIT: In the case of nested loops, in order to try hard to fit those branch prediction 'static' rules, better start crafting from the deepest loop. |
|||
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2026, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.