flat assembler
Message board for the users of flat assembler.

Index > Main > instruction memory bandwidth

Author
Thread Post new topic Reply to topic
sylware



Joined: 23 Oct 2020
Posts: 562
Location: Marseille/France
sylware 10 Apr 2026, 08:18
Is there some benchmarks of some realistic use cases out there which shows that instruction memory bandwidth can have a significant impact on performance?

(on modern common CPUs ofc)
Post 10 Apr 2026, 08:18
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20938
Location: In your JS exploiting you and your system
revolution 10 Apr 2026, 08:40
What does "instruction memory bandwidth" mean? Is it the overhead of reading instruction bytes into the CPU from memory?

Unrolling loops, and very long sections of code without branches, can render the caches useless. Caches are only useful when the data is read more than once, so the CPU is always reading from DRAM. I imagine that will create huge slowdowns. AKA cache thrashing.
Post 10 Apr 2026, 08:40
View user's profile Send private message Visit poster's website Reply with quote
sylware



Joined: 23 Oct 2020
Posts: 562
Location: Marseille/France
sylware 11 Apr 2026, 08:32
I meant real life benchmarks which show a significant speep impact of code density for modern ISAs.

Is that even a thing?
Post 11 Apr 2026, 08:32
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20938
Location: In your JS exploiting you and your system
revolution 11 Apr 2026, 08:42
sylware wrote:
Is that even a thing?
It probably is, but the term "code density" needs to be explained more. Does it mean always choosing the shortest set of instructions to complete a task (i.e. code golf)?
Post 11 Apr 2026, 08:42
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4428
Location: vpcmpistri
bitRAKE 11 Apr 2026, 10:36
I'm confused as well. "code density" seems like you want to benchmark the decoder. Compiled code isn't very dense. Size optimized code can be very dense.

Usually what you want to do is pick some part of the pipeline you want to test and then design or search for a test that does that -- it probably exists.
Post 11 Apr 2026, 10:36
View user's profile Send private message Visit poster's website Reply with quote
sylware



Joined: 23 Oct 2020
Posts: 562
Location: Marseille/France
sylware 11 Apr 2026, 17:34
It translates to this question: On modern CPU, can memory bandwidth can be an issue for feeding instructions to the CPU? That in real life use cases.
Post 11 Apr 2026, 17:34
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20938
Location: In your JS exploiting you and your system
revolution 11 Apr 2026, 21:37
Isn't the first reply the answer? If not then there needs to be more clarity on exactly what is meant to be measured and what criteria are used to decide what is an issue.
Post 11 Apr 2026, 21:37
View user's profile Send private message Visit poster's website Reply with quote
sylware



Joined: 23 Oct 2020
Posts: 562
Location: Marseille/France
sylware 12 Apr 2026, 01:50
Basically, the CPU stalls because it cannot load from memory fast enough the machine instructions, that in real life use cases.

(risc-v and arm have "compressed" and "thumb" machine instructions, but I am really not convince it makes a performance difference on non-niche hardware)
Post 12 Apr 2026, 01:50
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4428
Location: vpcmpistri
bitRAKE 12 Apr 2026, 02:01
sylware wrote:
On modern CPU, can memory bandwidth can be an issue for feeding instructions to the CPU? That in real life use cases.
Yes. Initialization code executes only once and pollutes the cache. The processor is _very_ good at predicting the next cache line needed for the instruction cache though. One could make that prediction more difficult and the initialization code much larger. Did you know that the Collatz (3x+1) trajectories are unpredictable but always reach one (exit condition)?
Code:
use64
align 64
        shr eax, 1
        jz @F
        jnc .even
        lea eax, [3*rax+2]
.even:  imul edx, eax, 64
        add rdx, rbx ; memory base
        jmp rdx
@@: retn    
... fill a large piece of memory with these cache lines (the processor does not know they are the same). Put numbers into EAX and call to a cache line. I'm sure there are many techniques to do the same thing. Obviously, you'll need to fill enough cache lines to cover the Collatz trajectories you're using. Also you'll need the trajectory length to exceed your cache size divided by 64.

Edit: actually, the trajectories seem too sparse for a given bit length - a more complex attenuation of the signal would be needed, or multiple trajectories is easier.

_________________
¯\(°_o)/¯ AI may [not] have aided with the above reply.


Last edited by bitRAKE on 12 Apr 2026, 02:26; edited 4 times in total
Post 12 Apr 2026, 02:01
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20938
Location: In your JS exploiting you and your system
revolution 12 Apr 2026, 02:08
ARM Thumb (16-bit) instructions are generally slower than the full length 32-bit instructions for a given task. Thumb is more constrained and requires more instructions to complete a task. Thumb wasn't designed for performance though, it was intended for memory constrained systems. So any performance comparison is flawed because it was never the intent to be a performance advantage.

x86 has single byte instructions, but a single byte can only encode a small amount of detail, and thus those instructions are limited in scope. Adding more bytes gives more expressibility, reducing instruction counts, but increases byte counts. There is a trade-off. Where should that trade-off be made? Depends upon the task.
Post 12 Apr 2026, 02:08
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20938
Location: In your JS exploiting you and your system
revolution 12 Apr 2026, 02:12
bitRAKE wrote:
... the Collatz (3x+1) ... always reach one ...
That is a conjecture, not a proven result. If anyone finds a counterexample they will be famous.
Post 12 Apr 2026, 02:12
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4428
Location: vpcmpistri
bitRAKE 12 Apr 2026, 02:25
pcbarina.fit.vutbr.cz wrote:
2025-01-15 the convergence of all numbers below 2^71 is verified
... we're probably safe for the range needed in this benchmark.
(Until the processor manufacturer starts copying benchmark methodologies.)
Post 12 Apr 2026, 02:25
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20938
Location: In your JS exploiting you and your system
revolution 12 Apr 2026, 05:29
I am reminded of the Itanium EPIC instruction encoding: 128 bits per "bundle".

It was very powerful and could potentially be quite compact. But it turned out to be too complex and too hard for the compilers to make good code.

Maybe 32-bits/instruction is the sweet spot? Many RISC encodings use 32-bits.

For non-performance applications the field is much more diverse. Especially the older Z80, 6502, etc. and the current PIC (8, 10, 12, 14, 16 bit) and AVR (16 bit).
Post 12 Apr 2026, 05:29
View user's profile Send private message Visit poster's website Reply with quote
sylware



Joined: 23 Oct 2020
Posts: 562
Location: Marseille/France
sylware 12 Apr 2026, 10:25
So, this is what I expected, those benchmarks do not exist, memory bandwidth is hardly a thing while dealing with CPU machine instructions.

While I am thinking of it, what about the memory alignment of the machine instruction fetch window? Because, if I am not too much mistaken, modern CPUs are using this instruction fetch window for their own optimizations (branch prediction, etc). Maybe it is worth a forum thread on its own as it is what matters in the end?
Post 12 Apr 2026, 10:25
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20938
Location: In your JS exploiting you and your system
revolution 12 Apr 2026, 10:58
Benchmarks are useless anyway, so I don't think it matters much.
Post 12 Apr 2026, 10:58
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2026, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.