flat assembler
Message board for the users of flat assembler.

Index > Main > L1 cache latency vs recompute

Goto page 1, 2, 3  Next
Author
Thread Post new topic Reply to topic
sylware



Joined: 23 Oct 2020
Posts: 441
Location: Marseille/France
sylware 10 May 2023, 12:42
On modern architectures (for instance Zen[1234]), roughly speaking, how many basic arithmetic operations I can fit in a L1 cache load?

Because instead of two consecutive L1 cache loads (same cache line), it may be more interesting to do one L1 cache load and recompute one of the loaded value from the other loaded value if it is really simple.
Post 10 May 2023, 12:42
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4050
Location: vpcmpistri
bitRAKE 10 May 2023, 14:21
Really depends where it's coming from and if it's a predicted load. The full unpredicted load from memory could shadow hundreds of operations.* Definitely, something to test.

* This is assuming an ideal case where the memory request can be served immediately with no contention or other delays. In practice, contention, queuing delays, page table walks, and other factors can significantly increase the effective latency of a memory access.

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 10 May 2023, 14:21
View user's profile Send private message Visit poster's website Reply with quote
sylware



Joined: 23 Oct 2020
Posts: 441
Location: Marseille/France
sylware 10 May 2023, 19:20
for instance, in the same cache line in L1 cache:
value0,value1, then value1=(value0+0xff)&0xffffffffffffff00

mov rax,qword ptr [rel val0]
mov rcx,qword ptr [rel val1]

vs

mov rax,qword ptr [rel val0]
mov rcx,rax
add rcx,0xff
and rcx,~0xff

I am thinking about cache L1 latencies in agner's doc. Namely the cost of transfer from L1 to a register. Because the transfer time could be actually significant compared to a few register-only simple instructions. Do we have some metrics on this transfer cost?

Additonally, if done everywhere, more could be fit in the cache which means in the end more data from the cache then should favor overall speed.
Post 10 May 2023, 19:20
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20363
Location: In your JS exploiting you and your system
revolution 10 May 2023, 19:42
Forget about using static timing values. They are useless without knowing completely the internal state of all instructions before and after the load. The load latency might, or might not, be completely irrelevant. It could be the case that a 20 cycle latency makes no difference for your code because some other instruction is causing delays.

Modern CPUs can have more than 100 instruction in flight executing in opportunistic order when all the dependencies are available.

When you have your full application working, then you can try both of your proposed code sequences and see which performs better, on real world data in the working app. No amount of synthetic testing or reading latency values can substitute for this.

But that said, there are some base level truths to be aware of. Caches are small, and it can be of great value to minimise the redundant data stored in them so as to maximise their effectiveness. You can't know if it will help in your particular application until you try each path and see the results.


Last edited by revolution on 10 May 2023, 20:17; edited 2 times in total
Post 10 May 2023, 19:42
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4050
Location: vpcmpistri
bitRAKE 10 May 2023, 20:02
The code is larger and it needs to come from somewhere and be stored; there is a dependency chain which may or may not be mitigated by surrounding code, etc.

L1 has 4 cycle latency, but the two loads can be done in parallel.

The dependent chain is obviously longer out of context.
Post 10 May 2023, 20:02
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2523
Furs 11 May 2023, 13:30
sylware wrote:
I am thinking about cache L1 latencies in agner's doc. Namely the cost of transfer from L1 to a register. Because the transfer time could be actually significant compared to a few register-only simple instructions. Do we have some metrics on this transfer cost?
It's 4 cycles afaik.

Of course, it's not that simple, because while other operations may be 1 clock cycle, they're done in parallel, but using different units (ALUs, etc) and ports, than a memory read. So if those are full and bottlenecked, the arithmetic instructions will stall more.

As a general rule of thumb, though, just writing it initially without testing it, and also because different CPUs are different and you may want to make only one "generic" binary that works reasonably well: Favor arithmetic instructions if their latencies are small enough.

That's because that even if it's slower in that particular calculation, it still frees up L1 cache for other things that may be faster later down the line.

And try to use the smallest data types you can possibly use that still fit all the values you need. Do still do arithmetic on 32-bit registers though, but store them in smaller types (read them with movzx or movsx into 32-bit register when needed).
Post 11 May 2023, 13:30
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20363
Location: In your JS exploiting you and your system
revolution 11 May 2023, 19:12
Furs wrote:
... Favor arithmetic instructions if their latencies are small enough.

That's because that even if it's slower in that particular calculation, it still frees up L1 cache for other things that may be faster later down the line.
You have competing priorities. You can free up L1D but use more L1I.

But why only consider L1? There are many other units in a CPU, L1 is just one of them. If you were to try and write it all out and predict where, when and how the instruction flows through the CPU you will fail because there are also many other things happening that are unrelated to that one instruction, and all those things compete for the same resources.

Actually if one has to ask if a 3-cycle L! latency is "faster" or "slower" for this one instruction then it signals that the app is not yet ready to be in the optimisation stage. Because it is easy to simply write it both ways and test it. So asking will A be faster than B is the wrong question. It it very unlikely it can be predicted by simply reading Agner Fogg tables. They tell you nothing about how the instructions executes within the environment of 100 other instructions and all the data flows and cache states, and many other things, that are unknowable until it is run on real data.

Don't guess. Measure.
Post 11 May 2023, 19:12
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4050
Location: vpcmpistri
bitRAKE 12 May 2023, 05:23
sylware wrote:
for instance, in the same cache line in L1 cache:
value0,value1, then value1=(value0+0xff)&0xffffffffffffff00

mov rax,qword ptr [rel val0]
mov rcx,qword ptr [rel val1]

vs

mov rax,qword ptr [rel val0]
mov rcx,rax
add rcx,0xff
and rcx,~0xff
Assuming this is all we have to go on, and our #1 imperative is: must reduce L1 data cache usage. We can even manufacture a little story: maybe, a complex function table that's been pre-computed, and other tables have put too much pressure on the L1 data cache - making it difficult to stream in the data being operated on. This table is the least complex amongst the group - low hanging fruit. So, there is a lot of processing taking place and we should be able to shadow greater complexity within code that is clearing not saturating the execution units.

So, we try some different alternatives ...
Code:
mov ecx, 0xFF
mov rax, [rel val0]
add rcx, rax
and rcx, -0x100

mov rax, [rel val0]
lea rcx, [rax + 0xFF]
and rcx, -0x100

; R15 = 0xFF, code size almost the same in the loop
mov rax, [rel val0]
lea rcx, [rax + r15]
andn rcx, r15, rcx    
... it's a fun game to brainstorm on different alternatives.

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 12 May 2023, 05:23
View user's profile Send private message Visit poster's website Reply with quote
sylware



Joined: 23 Oct 2020
Posts: 441
Location: Marseille/France
sylware 12 May 2023, 10:54
The thing is not to be perfect but to have a general idea on first implementation, then, if we would want to spend more time on it, to do some measurements...

Measurements: I would have to do coarse measurements in order to measure the impact of fine measurements! That with a few proper "real" system loads.

Basically, per type of system load, first coarse rdtsc-s alone, then coarse + fine rdtsc-s to see if how much the fine rdtsc-s impact the targetted code path.
Then I can work an the fine rdtsc-s.

Yeah, only have a zen2.

Hopefully I'll find out some general ideas on which first implementation I should favor (or obvious zen2 performance issues Smile ).
Post 12 May 2023, 10:54
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20363
Location: In your JS exploiting you and your system
revolution 12 May 2023, 11:08
Using RDTSC is fraught with problems. Do you really need sub microsecond timing? If so then using a non-RTOS is going to cause you much grief. Sad

The normal millisecond resolution (Note: it isn't millisecond accurate!) timer that the OS provides is usually perfectly fine. You'll probably want run times of at least a few seconds to wash out the normal jitter, and the longer the better.
Post 12 May 2023, 11:08
View user's profile Send private message Visit poster's website Reply with quote
sylware



Joined: 23 Oct 2020
Posts: 441
Location: Marseille/France
sylware 12 May 2023, 12:51
Well, since I would like to evaluate the impact of small code paths, it means I would need fairly big measurement data to hope not to be hidden by the noise.

Namely, if I can "see" a significant difference in the coarse measurement data between "coarse" "coarse+fined" runs, it means fined-grained has a significant impact and cannot be used.

We are going to end up using the maths from the LHC and talking 5 sigmas... but it is harder than the LHC here because we have a huge variable to account for: the background load of a system.
Post 12 May 2023, 12:51
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20363
Location: In your JS exploiting you and your system
revolution 12 May 2023, 13:02
sylware wrote:
Well, since I would like to evaluate the impact of small code paths, it means I would need fairly big measurement data to hope not to be hidden by the noise.
But that is all part of it. If you can't see any difference between paths in the wall-clock runtimes then it really doesn't matter which path you choose, since they are equivalent.

Code paths don't run in a vacuum, they are part of the overall program and effects happen in both directions. You might find the "perfect" micro-loop code but then kill your overall application performance because the cache is clobbered, or the write buffers are saturated, or the branch prediction table needs retraining, or any of 1000 other things. Only the overall performance should matter.
Post 12 May 2023, 13:02
View user's profile Send private message Visit poster's website Reply with quote
sylware



Joined: 23 Oct 2020
Posts: 441
Location: Marseille/France
sylware 12 May 2023, 14:18
It means I will need a huge set of overall measurement data with different system load.

I am already crying. LHC people don't realize how lucky they are.
Post 12 May 2023, 14:18
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4050
Location: vpcmpistri
bitRAKE 12 May 2023, 17:59
sylware wrote:
The thing is not to be perfect but to have a general idea on first implementation.
First implementation should be as simple as possible. I'm only thinking about optimization on first pass if I'm trying to force some specific instruction usage (i.e. like code-size). Even if I'm working from reference research, implementing another's optimizations - I'm still only trying to implement their algorithm.

This will give you something to build on and serve as a comparative reference for changes.

Probably not the answer you're looking for.

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 12 May 2023, 17:59
View user's profile Send private message Visit poster's website Reply with quote
macgub



Joined: 11 Jan 2006
Posts: 348
Location: Poland
macgub 16 May 2023, 07:28
I really like discussions like that.

bitRAKE wrote:
Really depends where it's coming from and if it's a predicted load. The full unpredicted load from memory could shadow hundreds of operations.* Definitely, something to test.


So in CPUS there is branch prediction, memory load prediction, what kind of predictions else?
Post 16 May 2023, 07:28
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4050
Location: vpcmpistri
bitRAKE 16 May 2023, 16:02
Organizing it hierarchically might help clarify the relationships between the different types of prediction and speculation. Here is a possible way to categorize these concepts:
  1. Prefetching: Predicting which data or instructions will be needed next and loading them ahead of time.
    • Data Prefetching: Predicts which data will be needed next and loads it into the cache.
    • Instruction Prefetching: Predicts which instructions will be needed next and loads them into the instruction queue.

  2. Execution Strategies: Methods used to enhance the execution process of instructions.
    • Out-of-order Execution: Allows the CPU to execute instructions not in the sequential order of the program but based on the availability of input data and execution units.
    • Speculative Execution: The CPU predicts the outcome of instructions and executes subsequent instructions before the actual outcome is known.
    • Branch Prediction: Predicts the outcome of a conditional operation before it's completed.
      • Branch Target Buffer (BTB): Stores branch target addresses to aid in prediction.
      • Branch History Table (BHT): Stores the history of branch outcomes to improve prediction accuracy.
    • Memory Load/Store Prediction: Predicts the addresses of load and store operations.
    • Value Prediction: Predicts the result of an operation before it is computed.
    • Return Address Stack (RAS): Predicts the return address of a function, useful in deeply nested function calls.
    • Indirect Branch Prediction: Predicts the target of a pointer or a jump operation.

This hierarchy represents a high-level view of prediction and speculation methods in modern CPUs. It's important to note that the actual implementation can be more complex and may involve additional techniques not listed here.

For example, modern CPUs use multi-level cache hierarchies to balance access speed, capacity, and power consumption - which greatly complicates prefetching.

On the execution side there is also a multi-level aspect: modern x86 processors utilize a feature called a decoded instruction cache or micro-operation cache (µop cache). This feature is designed to bypass the relatively expensive x86 instruction decoding stage, which is a significant performance bottleneck due to the complexity of the x86 instruction set.

In a typical execution pipeline, instructions must be fetched, decoded into simpler micro-operations (µops), and then executed. The decoding stage can be quite complex, especially for x86 instructions, which have variable lengths and complex encoding.

The µop cache stores recently decoded instructions. When an instruction is fetched, the processor can first check the µop cache to see if the instruction's decoded µops are already available. If they are, the processor can skip the decoding stage for that instruction and directly fetch the µops from the cache, which saves time and power.

This is why the instruction architecture matters less at this performance level. Technologies needed below the decode are virtually the same. Modern processors are like supercomputers.

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 16 May 2023, 16:02
View user's profile Send private message Visit poster's website Reply with quote
macgub



Joined: 11 Jan 2006
Posts: 348
Location: Poland
macgub 17 May 2023, 06:47
bitRAKE
Thanks for detailed explanation...

Thread subject remind me one nvidia tutorial. Authors describes there on the fly computations to reduce GPU - RAM transfer.
https://developer.nvidia.com/gpugems/gpugems3/part-i-geometry/chapter-5-generic-adaptive-mesh-refinement
Post 17 May 2023, 06:47
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4050
Location: vpcmpistri
bitRAKE 17 May 2023, 07:13
I'd like to see instructions for smart memory: Imagine clearing/moving memory being done by the memory. For example,

REP STOSNTQ - non-temporally write RCX qwords of RAX at RDI.

This instruction would have the main memory update the region without processor intervention. The memory controller could fake the transaction until it's complete by tracking a number of operations in flight. As memory cells become more complex - these types of operations wouldn't take that long to complete, and the processor wouldn't know that difference anyhow.

(Not unlike a DMA transfer, but probably different rules for error handling (i.e. none).)

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 17 May 2023, 07:13
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20363
Location: In your JS exploiting you and your system
revolution 17 May 2023, 08:26
Useful IMO would be content addressable memory (like the cache tags are). Then key lookups can be done in parallel by the memory, returning either {the first, or a list of, matched address(es)}, or {the value part of a key:value pair}.

But it is expensive to produce in silicon real estate apparently. Which is why only cache tags use it at the moment. There are probably be some special ASICs that incorporate it also.
Post 17 May 2023, 08:26
View user's profile Send private message Visit poster's website Reply with quote
sylware



Joined: 23 Oct 2020
Posts: 441
Location: Marseille/France
sylware 17 May 2023, 10:20
I thought of something: I know we don't have a lot of registers, but why not restricting ourselves to loads/stores and regs-only instructions. Kind of risc-v, but with half the registers.
Post 17 May 2023, 10:20
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2, 3  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.