flat assembler
Message board for the users of flat assembler.
Index
> Main > L1 cache latency vs recompute Goto page 1, 2, 3 Next |
Author |
|
bitRAKE 10 May 2023, 14:21
Really depends where it's coming from and if it's a predicted load. The full unpredicted load from memory could shadow hundreds of operations.* Definitely, something to test.
* This is assuming an ideal case where the memory request can be served immediately with no contention or other delays. In practice, contention, queuing delays, page table walks, and other factors can significantly increase the effective latency of a memory access. _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
10 May 2023, 14:21 |
|
sylware 10 May 2023, 19:20
for instance, in the same cache line in L1 cache:
value0,value1, then value1=(value0+0xff)&0xffffffffffffff00 mov rax,qword ptr [rel val0] mov rcx,qword ptr [rel val1] vs mov rax,qword ptr [rel val0] mov rcx,rax add rcx,0xff and rcx,~0xff I am thinking about cache L1 latencies in agner's doc. Namely the cost of transfer from L1 to a register. Because the transfer time could be actually significant compared to a few register-only simple instructions. Do we have some metrics on this transfer cost? Additonally, if done everywhere, more could be fit in the cache which means in the end more data from the cache then should favor overall speed. |
|||
10 May 2023, 19:20 |
|
revolution 10 May 2023, 19:42
Forget about using static timing values. They are useless without knowing completely the internal state of all instructions before and after the load. The load latency might, or might not, be completely irrelevant. It could be the case that a 20 cycle latency makes no difference for your code because some other instruction is causing delays.
Modern CPUs can have more than 100 instruction in flight executing in opportunistic order when all the dependencies are available. When you have your full application working, then you can try both of your proposed code sequences and see which performs better, on real world data in the working app. No amount of synthetic testing or reading latency values can substitute for this. But that said, there are some base level truths to be aware of. Caches are small, and it can be of great value to minimise the redundant data stored in them so as to maximise their effectiveness. You can't know if it will help in your particular application until you try each path and see the results. Last edited by revolution on 10 May 2023, 20:17; edited 2 times in total |
|||
10 May 2023, 19:42 |
|
bitRAKE 10 May 2023, 20:02
The code is larger and it needs to come from somewhere and be stored; there is a dependency chain which may or may not be mitigated by surrounding code, etc.
L1 has 4 cycle latency, but the two loads can be done in parallel. The dependent chain is obviously longer out of context. |
|||
10 May 2023, 20:02 |
|
Furs 11 May 2023, 13:30
sylware wrote: I am thinking about cache L1 latencies in agner's doc. Namely the cost of transfer from L1 to a register. Because the transfer time could be actually significant compared to a few register-only simple instructions. Do we have some metrics on this transfer cost? Of course, it's not that simple, because while other operations may be 1 clock cycle, they're done in parallel, but using different units (ALUs, etc) and ports, than a memory read. So if those are full and bottlenecked, the arithmetic instructions will stall more. As a general rule of thumb, though, just writing it initially without testing it, and also because different CPUs are different and you may want to make only one "generic" binary that works reasonably well: Favor arithmetic instructions if their latencies are small enough. That's because that even if it's slower in that particular calculation, it still frees up L1 cache for other things that may be faster later down the line. And try to use the smallest data types you can possibly use that still fit all the values you need. Do still do arithmetic on 32-bit registers though, but store them in smaller types (read them with movzx or movsx into 32-bit register when needed). |
|||
11 May 2023, 13:30 |
|
revolution 11 May 2023, 19:12
Furs wrote: ... Favor arithmetic instructions if their latencies are small enough. But why only consider L1? There are many other units in a CPU, L1 is just one of them. If you were to try and write it all out and predict where, when and how the instruction flows through the CPU you will fail because there are also many other things happening that are unrelated to that one instruction, and all those things compete for the same resources. Actually if one has to ask if a 3-cycle L! latency is "faster" or "slower" for this one instruction then it signals that the app is not yet ready to be in the optimisation stage. Because it is easy to simply write it both ways and test it. So asking will A be faster than B is the wrong question. It it very unlikely it can be predicted by simply reading Agner Fogg tables. They tell you nothing about how the instructions executes within the environment of 100 other instructions and all the data flows and cache states, and many other things, that are unknowable until it is run on real data. Don't guess. Measure. |
|||
11 May 2023, 19:12 |
|
bitRAKE 12 May 2023, 05:23
sylware wrote: for instance, in the same cache line in L1 cache: So, we try some different alternatives ... Code: mov ecx, 0xFF mov rax, [rel val0] add rcx, rax and rcx, -0x100 mov rax, [rel val0] lea rcx, [rax + 0xFF] and rcx, -0x100 ; R15 = 0xFF, code size almost the same in the loop mov rax, [rel val0] lea rcx, [rax + r15] andn rcx, r15, rcx _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
12 May 2023, 05:23 |
|
sylware 12 May 2023, 10:54
The thing is not to be perfect but to have a general idea on first implementation, then, if we would want to spend more time on it, to do some measurements...
Measurements: I would have to do coarse measurements in order to measure the impact of fine measurements! That with a few proper "real" system loads. Basically, per type of system load, first coarse rdtsc-s alone, then coarse + fine rdtsc-s to see if how much the fine rdtsc-s impact the targetted code path. Then I can work an the fine rdtsc-s. Yeah, only have a zen2. Hopefully I'll find out some general ideas on which first implementation I should favor (or obvious zen2 performance issues ). |
|||
12 May 2023, 10:54 |
|
revolution 12 May 2023, 11:08
Using RDTSC is fraught with problems. Do you really need sub microsecond timing? If so then using a non-RTOS is going to cause you much grief.
The normal millisecond resolution (Note: it isn't millisecond accurate!) timer that the OS provides is usually perfectly fine. You'll probably want run times of at least a few seconds to wash out the normal jitter, and the longer the better. |
|||
12 May 2023, 11:08 |
|
sylware 12 May 2023, 12:51
Well, since I would like to evaluate the impact of small code paths, it means I would need fairly big measurement data to hope not to be hidden by the noise.
Namely, if I can "see" a significant difference in the coarse measurement data between "coarse" "coarse+fined" runs, it means fined-grained has a significant impact and cannot be used. We are going to end up using the maths from the LHC and talking 5 sigmas... but it is harder than the LHC here because we have a huge variable to account for: the background load of a system. |
|||
12 May 2023, 12:51 |
|
revolution 12 May 2023, 13:02
sylware wrote: Well, since I would like to evaluate the impact of small code paths, it means I would need fairly big measurement data to hope not to be hidden by the noise. Code paths don't run in a vacuum, they are part of the overall program and effects happen in both directions. You might find the "perfect" micro-loop code but then kill your overall application performance because the cache is clobbered, or the write buffers are saturated, or the branch prediction table needs retraining, or any of 1000 other things. Only the overall performance should matter. |
|||
12 May 2023, 13:02 |
|
sylware 12 May 2023, 14:18
It means I will need a huge set of overall measurement data with different system load.
I am already crying. LHC people don't realize how lucky they are. |
|||
12 May 2023, 14:18 |
|
bitRAKE 12 May 2023, 17:59
sylware wrote: The thing is not to be perfect but to have a general idea on first implementation. This will give you something to build on and serve as a comparative reference for changes. Probably not the answer you're looking for. _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
12 May 2023, 17:59 |
|
macgub 16 May 2023, 07:28
I really like discussions like that.
bitRAKE wrote: Really depends where it's coming from and if it's a predicted load. The full unpredicted load from memory could shadow hundreds of operations.* Definitely, something to test. So in CPUS there is branch prediction, memory load prediction, what kind of predictions else? |
|||
16 May 2023, 07:28 |
|
bitRAKE 16 May 2023, 16:02
Organizing it hierarchically might help clarify the relationships between the different types of prediction and speculation. Here is a possible way to categorize these concepts:
This hierarchy represents a high-level view of prediction and speculation methods in modern CPUs. It's important to note that the actual implementation can be more complex and may involve additional techniques not listed here. For example, modern CPUs use multi-level cache hierarchies to balance access speed, capacity, and power consumption - which greatly complicates prefetching. On the execution side there is also a multi-level aspect: modern x86 processors utilize a feature called a decoded instruction cache or micro-operation cache (µop cache). This feature is designed to bypass the relatively expensive x86 instruction decoding stage, which is a significant performance bottleneck due to the complexity of the x86 instruction set. In a typical execution pipeline, instructions must be fetched, decoded into simpler micro-operations (µops), and then executed. The decoding stage can be quite complex, especially for x86 instructions, which have variable lengths and complex encoding. The µop cache stores recently decoded instructions. When an instruction is fetched, the processor can first check the µop cache to see if the instruction's decoded µops are already available. If they are, the processor can skip the decoding stage for that instruction and directly fetch the µops from the cache, which saves time and power. This is why the instruction architecture matters less at this performance level. Technologies needed below the decode are virtually the same. Modern processors are like supercomputers. _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
16 May 2023, 16:02 |
|
macgub 17 May 2023, 06:47
bitRAKE
Thanks for detailed explanation... Thread subject remind me one nvidia tutorial. Authors describes there on the fly computations to reduce GPU - RAM transfer. https://developer.nvidia.com/gpugems/gpugems3/part-i-geometry/chapter-5-generic-adaptive-mesh-refinement |
|||
17 May 2023, 06:47 |
|
bitRAKE 17 May 2023, 07:13
I'd like to see instructions for smart memory: Imagine clearing/moving memory being done by the memory. For example,
REP STOSNTQ - non-temporally write RCX qwords of RAX at RDI. This instruction would have the main memory update the region without processor intervention. The memory controller could fake the transaction until it's complete by tracking a number of operations in flight. As memory cells become more complex - these types of operations wouldn't take that long to complete, and the processor wouldn't know that difference anyhow. (Not unlike a DMA transfer, but probably different rules for error handling (i.e. none).) _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
17 May 2023, 07:13 |
|
revolution 17 May 2023, 08:26
Useful IMO would be content addressable memory (like the cache tags are). Then key lookups can be done in parallel by the memory, returning either {the first, or a list of, matched address(es)}, or {the value part of a key:value pair}.
But it is expensive to produce in silicon real estate apparently. Which is why only cache tags use it at the moment. There are probably be some special ASICs that incorporate it also. |
|||
17 May 2023, 08:26 |
|
sylware 17 May 2023, 10:20
I thought of something: I know we don't have a lot of registers, but why not restricting ourselves to loads/stores and regs-only instructions. Kind of risc-v, but with half the registers.
|
|||
17 May 2023, 10:20 |
|
Goto page 1, 2, 3 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.