instruction memory bandwidth

Index > Main > instruction memory bandwidth

Author

Thread

sylware

Joined: 23 Oct 2020
Posts: 589
Location: Marseille/France

sylware 10 Apr 2026, 08:18

Is there some benchmarks of some realistic use cases out there which shows that instruction memory bandwidth can have a significant impact on performance?

(on modern common CPUs ofc)

10 Apr 2026, 08:18

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20990
Location: In your JS exploiting you and your system

revolution 10 Apr 2026, 08:40

What does "instruction memory bandwidth" mean? Is it the overhead of reading instruction bytes into the CPU from memory?

Unrolling loops, and very long sections of code without branches, can render the caches useless. Caches are only useful when the data is read more than once, so the CPU is always reading from DRAM. I imagine that will create huge slowdowns. AKA cache thrashing.

10 Apr 2026, 08:40

sylware

Joined: 23 Oct 2020
Posts: 589
Location: Marseille/France

sylware 11 Apr 2026, 08:32

I meant real life benchmarks which show a significant speep impact of code density for modern ISAs.

Is that even a thing?

11 Apr 2026, 08:32

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20990
Location: In your JS exploiting you and your system

revolution 11 Apr 2026, 08:42

sylware wrote:

Is that even a thing?

It probably is, but the term "code density" needs to be explained more. Does it mean always choosing the shortest set of instructions to complete a task (i.e. code golf)?

11 Apr 2026, 08:42

bitRAKE

Joined: 21 Jul 2003
Posts: 4487
Location: vpcmpistri

bitRAKE 11 Apr 2026, 10:36

I'm confused as well. "code density" seems like you want to benchmark the decoder. Compiled code isn't very dense. Size optimized code can be very dense.

Usually what you want to do is pick some part of the pipeline you want to test and then design or search for a test that does that -- it probably exists.

11 Apr 2026, 10:36

sylware

Joined: 23 Oct 2020
Posts: 589
Location: Marseille/France

sylware 11 Apr 2026, 17:34

It translates to this question: On modern CPU, can memory bandwidth can be an issue for feeding instructions to the CPU? That in real life use cases.

11 Apr 2026, 17:34

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20990
Location: In your JS exploiting you and your system

revolution 11 Apr 2026, 21:37

Isn't the first reply the answer? If not then there needs to be more clarity on exactly what is meant to be measured and what criteria are used to decide what is an issue.

11 Apr 2026, 21:37

sylware

Joined: 23 Oct 2020
Posts: 589
Location: Marseille/France

sylware 12 Apr 2026, 01:50

Basically, the CPU stalls because it cannot load from memory fast enough the machine instructions, that in real life use cases.

(risc-v and arm have "compressed" and "thumb" machine instructions, but I am really not convince it makes a performance difference on non-niche hardware)

12 Apr 2026, 01:50

bitRAKE

Joined: 21 Jul 2003
Posts: 4487
Location: vpcmpistri

bitRAKE 12 Apr 2026, 02:01

sylware wrote:

On modern CPU, can memory bandwidth can be an issue for feeding instructions to the CPU? That in real life use cases.

Yes. Initialization code executes only once and pollutes the cache. The processor is _very_ good at predicting the next cache line needed for the instruction cache though. One could make that prediction more difficult and the initialization code much larger. Did you know that the Collatz (3x+1) trajectories are unpredictable but always reach one (exit condition)?

Code:

use64
align 64
        shr eax, 1
        jz @F
        jnc .even
        lea eax, [3*rax+2]
.even:  imul edx, eax, 64
        add rdx, rbx ; memory base
        jmp rdx
@@: retn

... fill a large piece of memory with these cache lines (the processor does not know they are the same). Put numbers into EAX and call to a cache line. I'm sure there are many techniques to do the same thing. Obviously, you'll need to fill enough cache lines to cover the Collatz trajectories you're using. Also you'll need the trajectory length to exceed your cache size divided by 64.

Edit: actually, the trajectories seem too sparse for a given bit length - a more complex attenuation of the signal would be needed, or multiple trajectories is easier.

_________________
¯\(°_o)/¯ AI may [not] have aided with the above reply.

Last edited by bitRAKE on 12 Apr 2026, 02:26; edited 4 times in total

12 Apr 2026, 02:01

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20990
Location: In your JS exploiting you and your system

revolution 12 Apr 2026, 02:08

ARM Thumb (16-bit) instructions are generally slower than the full length 32-bit instructions for a given task. Thumb is more constrained and requires more instructions to complete a task. Thumb wasn't designed for performance though, it was intended for memory constrained systems. So any performance comparison is flawed because it was never the intent to be a performance advantage.

x86 has single byte instructions, but a single byte can only encode a small amount of detail, and thus those instructions are limited in scope. Adding more bytes gives more expressibility, reducing instruction counts, but increases byte counts. There is a trade-off. Where should that trade-off be made? Depends upon the task.

12 Apr 2026, 02:08

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20990
Location: In your JS exploiting you and your system

revolution 12 Apr 2026, 02:12

bitRAKE wrote:

... the Collatz (3x+1) ... always reach one ...

That is a conjecture, not a proven result. If anyone finds a counterexample they will be famous.

12 Apr 2026, 02:12

bitRAKE

Joined: 21 Jul 2003
Posts: 4487
Location: vpcmpistri

bitRAKE 12 Apr 2026, 02:25

pcbarina.fit.vutbr.cz wrote:

2025-01-15 the convergence of all numbers below 2^71 is verified

... we're probably safe for the range needed in this benchmark.
(Until the processor manufacturer starts copying benchmark methodologies.)

12 Apr 2026, 02:25

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20990
Location: In your JS exploiting you and your system

revolution 12 Apr 2026, 05:29

I am reminded of the Itanium EPIC instruction encoding: 128 bits per "bundle".

It was very powerful and could potentially be quite compact. But it turned out to be too complex and too hard for the compilers to make good code.

Maybe 32-bits/instruction is the sweet spot? Many RISC encodings use 32-bits.

For non-performance applications the field is much more diverse. Especially the older Z80, 6502, etc. and the current PIC (8, 10, 12, 14, 16 bit) and AVR (16 bit).

12 Apr 2026, 05:29

sylware

Joined: 23 Oct 2020
Posts: 589
Location: Marseille/France

sylware 12 Apr 2026, 10:25

So, this is what I expected, those benchmarks do not exist, memory bandwidth is hardly a thing while dealing with CPU machine instructions.

While I am thinking of it, what about the memory alignment of the machine instruction fetch window? Because, if I am not too much mistaken, modern CPUs are using this instruction fetch window for their own optimizations (branch prediction, etc). Maybe it is worth a forum thread on its own as it is what matters in the end?

12 Apr 2026, 10:25

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20990
Location: In your JS exploiting you and your system

revolution 12 Apr 2026, 10:58

Benchmarks are useless anyway, so I don't think it matters much.

12 Apr 2026, 10:58

< Last Thread | Next Thread >

Forum Rules:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum