packed and partially used cache line register loads

Index > Main > packed and partially used cache line register loads

Author

Thread

sylware

Joined: 23 Oct 2020
Posts: 543
Location: Marseille/France

sylware 13 Jan 2026, 14:44

On modern hardware architectures, is it interesting to pack register loads from a cache line knowing some may not be used in the following code?

Or is it worth to spare some cache line register loads with some branches (and then hit the predictor...)?

Because I would not be surprised if modern microarchitectures have a fused "cache line to many registers" atomic instruction.

13 Jan 2026, 14:44

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20896
Location: In your JS exploiting you and your system

revolution 13 Jan 2026, 15:03

In general for most modern CPUs made today, transfers to/from memory to a different layer (mem -> L3 or L2 -> L1, etc.) are always in whole cache line block sizes. Never smaller.

So using redundant instructions to load more data than needed into registers is most likely only wasting I-Cache space and wasting energy unnecessarily.

However the real way to find out is to code up each method (with vs without) and see which is better for the app being measured. No amount of guessing based upon heuristics , or reading tea leaves, can give a definitive answer. Sometimes the results can by surprising.

13 Jan 2026, 15:03

sylware

Joined: 23 Oct 2020
Posts: 543
Location: Marseille/France

sylware 14 Jan 2026, 12:42

I know that the only real way is to code some close to real-life benchmark... but you need the CPUs, and still, there is some 'generic and common' CPU implementation optimizations which are rather common, or at least a 'way' to code which will be friendly to those optimizations.

revolution wrote:

So using redundant instructions to load more data than needed into registers is most likely only wasting I-Cache space and wasting energy unnecessarily.

This is another topic, but I am really not convince that, that on modern and large CPU implementations, I-Cache impact could be significant, for instance I have doubts on the silicon cost of 'compressed' machine instructions on RISC-V, 'thumb' machine instructions on arm(erk!).

There, we would need benchmarks (speed and energy consumption) arm(erk!) or RISC-V, on desktop/mobile/server implementations. Hopefully not in too weird niche use cases.

I have not started to implement 'compressed instructions' in my rv64onx64 interpreter for that reason (and that defeat the purpose of the 'R' in RISC).

I recall from some x64 optimization guide something about packing close to each other the loads/stores in a code fetch window up to a few (I don't recall the number range).

14 Jan 2026, 12:42

< Last Thread | Next Thread >

Forum Rules:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum