flat assembler
Message board for the users of flat assembler.

Index > Main > packed and partially used cache line register loads

Author
Thread Post new topic Reply to topic
sylware



Joined: 23 Oct 2020
Posts: 513
Location: Marseille/France
sylware 13 Jan 2026, 14:44
On modern hardware architectures, is it interesting to pack register loads from a cache line knowing some may not be used in the following code?

Or is it worth to spare some cache line register loads with some branches (and then hit the predictor...)?

Because I would not be surprised if modern microarchitectures have a fused "cache line to many registers" atomic instruction.
Post 13 Jan 2026, 14:44
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20825
Location: In your JS exploiting you and your system
revolution 13 Jan 2026, 15:03
In general for most modern CPUs made today, transfers to/from memory to a different layer (mem -> L3 or L2 -> L1, etc.) are always in whole cache line block sizes. Never smaller.

So using redundant instructions to load more data than needed into registers is most likely only wasting I-Cache space and wasting energy unnecessarily.

However the real way to find out is to code up each method (with vs without) and see which is better for the app being measured. No amount of guessing based upon heuristics , or reading tea leaves, can give a definitive answer. Sometimes the results can by surprising.
Post 13 Jan 2026, 15:03
View user's profile Send private message Visit poster's website Reply with quote
sylware



Joined: 23 Oct 2020
Posts: 513
Location: Marseille/France
sylware 14 Jan 2026, 12:42
I know that the only real way is to code some close to real-life benchmark... but you need the CPUs, and still, there is some 'generic and common' CPU implementation optimizations which are rather common, or at least a 'way' to code which will be friendly to those optimizations.


revolution wrote:
So using redundant instructions to load more data than needed into registers is most likely only wasting I-Cache space and wasting energy unnecessarily.


This is another topic, but I am really not convince that, that on modern and large CPU implementations, I-Cache impact could be significant, for instance I have doubts on the silicon cost of 'compressed' machine instructions on RISC-V, 'thumb' machine instructions on arm(erk!).


There, we would need benchmarks (speed and energy consumption) arm(erk!) or RISC-V, on desktop/mobile/server implementations. Hopefully not in too weird niche use cases.


I have not started to implement 'compressed instructions' in my rv64onx64 interpreter for that reason (and that defeat the purpose of the 'R' in RISC).

I recall from some x64 optimization guide something about packing close to each other the loads/stores in a code fetch window up to a few (I don't recall the number range).
Post 14 Jan 2026, 12:42
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.