flat assembler
Message board for the users of flat assembler.
Index
> Main > L1 cache latency vs recompute Goto page Previous 1, 2, 3 Next |
Author |
|
bitRAKE 17 May 2023, 16:44
revolution wrote: Useful IMO would be content addressable memory (like the cache tags are). Then key lookups can be done in parallel by the memory, returning either {the first, or a list of, matched address(es)}, or {the value part of a key:value pair}. My suggestion was to potentially provide a way to alleviate pressure on cache hierarchies and improve overall system performance. Something that could be implemented simply at first with changes in the integrated memory controller; and then (migrate?) improve along with memory technologies. _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
17 May 2023, 16:44 |
|
sylware 17 May 2023, 18:36
@furs, I have a hard time thinking about a fused µop between a memory load and arthmetics. I would understand a fused µop for several arthimetics instructions (for instance multiply-add).
@bitrake do you know of any serious attempt of a multi-core processor with only cache memory ? (we need a bit of lantency to arbitrate the accesses among cores, so maybe only 2 levels of cache memory). |
|||
17 May 2023, 18:36 |
|
bitRAKE 17 May 2023, 19:26
I'm not familiar with the research on cache only memory architecture (COMA). Looks to be quite active in the 90's. Non-locality is just so expensive as the number of cores increase. Historically, I think many of these concepts were explored with supercomputers - everything from no common memory to all common memory; and we got to the present approach through trying to program in those environments.
Not that there isn't room for innovation. Just that we should respect the past and ask what has changed? Does the increase in languages like Rust make the added complexity of programming in COMA more advantageous? How does this couple with a heterogenous environment of specialized cores? |
|||
17 May 2023, 19:26 |
|
sylware 17 May 2023, 21:22
@bitrake: a lot have changed, the hardware is much, MUCH, faster at the silicium level. It is all about wafer real estate: up to how much cache memory we can reasonably cram there. 8GB? 16GB? 2 or 3 levels (or 1!!) which would depend on the number of cores.
Compiled languages will become more and more obsolete as the risc-v ISA worldwide and royalty-free standard gets traction and core components/very high level language interpreters are ported to human assembly. Then planned obsolescence will be mostly from critical applications written with very high level languages. Ofc, this won't happen if risc-v is a failure. Rust is already obsolete, but a good defence against the c++ zealots as, it seems, rust syntax is orders of magnitude simpler than c++. |
|||
17 May 2023, 21:22 |
|
revolution 18 May 2023, 00:22
Having a super-large cache memory is probably not much value. Larger memories are slower. So after a certain size it becomes too slow and you lose all the main benefits. Plus the expense to create all that SRAM would make the CPU too costly (both in dollar and power budgets). There isn't much of a win with large caches.
|
|||
18 May 2023, 00:22 |
|
sylware 18 May 2023, 11:05
@revolution this is exactly why I am curious about this: what are those numbers on modern silicium?
There is also something: with 8GB or 16GB on the CPU die, no DIMM modules anymore. And that is surely a win. |
|||
18 May 2023, 11:05 |
|
revolution 18 May 2023, 11:29
8GB in the chip package will be DRAM. Not really suitable for cache.
It probably can perform some operations better than mobo mounted DRAM because of two main reasons. 1) It is physically closer and the connections can be optimised, and 2) it can have a wider data bus giving more bits per transfer. Many GPUs have these types of HBM modules included. But the downside is the lack of upgrade path. You are stuck with what you get. And in general DRAM has poor latency compared to SRAM. You can get good bandwidth with DRAM, but only on predictable bulk transfers. For random access it is awful. |
|||
18 May 2023, 11:29 |
|
Furs 18 May 2023, 14:19
sylware wrote: @furs, I have a hard time thinking about a fused µop between a memory load and arthmetics. I would understand a fused µop for several arthimetics instructions (for instance multiply-add). This shows cmp with memory operand can both macro and micro fuse: https://stackoverflow.com/questions/56413517/what-is-instruction-fusion-in-contemporary-x86-processors Also see here: https://news.ycombinator.com/item?id=16306359: BeeOnRope wrote: 4) The more interesting case of "complex" fused instructions is probably load-op, which only generate 1 (fused) uop so are eligible for the fast 1-1-1-1 decode treatment. |
|||
18 May 2023, 14:19 |
|
sylware 19 May 2023, 16:06
@revolution with apple mX die size, how much cache memory I could fit on such along a few cores? That with modern silicium process.
@furs, yeah, it seems some loads get fused with their major operations. Namely as soon as the data is loaded the actual operation is 1clk done. But the author says that gcc is starting to prefer to load then use regs only instructions: this is what I was talking about. So, things are happening here, thus my question. |
|||
19 May 2023, 16:06 |
|
bitRAKE 19 May 2023, 19:04
The AMD 3D packages only add 64MB of L3 cache and that covers half an eight core CCD die. 8GB of SRAM would need 128 times the area/power. Maybe some distant future? [1]
_________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
19 May 2023, 19:04 |
|
revolution 19 May 2023, 21:20
With 128 x 16MB it doesn't act like a cache anymore because it would be very slow. The main goal of cache is to be fast memory, so it necessarily has to be small.
|
|||
19 May 2023, 21:20 |
|
sylware 19 May 2023, 21:56
would would would.
Did somebody actually try? Or at least, did output some real detailed numbers? |
|||
19 May 2023, 21:56 |
|
revolution 19 May 2023, 21:58
Nobody tried because it is impossible to make with the tech we have now.
But the laws of physics dictate the outcome. Ya cannae break the laws of physics. |
|||
19 May 2023, 21:58 |
|
sylware 20 May 2023, 09:22
In this context, it would remove the need of memory modules, and that surely would "bother" some companies. I believe this is the main reason why it was "not tried". I will be pleased to be proven wrong.
|
|||
20 May 2023, 09:22 |
|
Furs 20 May 2023, 19:56
sylware wrote: But the author says that gcc is starting to prefer to load then use regs only instructions: this is what I was talking about. So, things are happening here, thus my question. The point is that splitting a memory operand instruction into separate load, and then one register-only operation a la RISC, is useless and harms performance (and adds code bloat). If you're going to operate on the same memory data for several instructions in a row, then it's better to load it into a register, of course. Just don't load it if you're only going to use it once from that register. e.g. bad: Code: mov eax, [var] add ebx, eax Code: add ebx, [var] sub ecx, [var] or edx, [var] Code: add ebx, [var] Code: mov eax, [var] add ebx, eax sub ecx, eax or edx, eax |
|||
20 May 2023, 19:56 |
|
sylware 21 May 2023, 10:22
What about this one (power of 2 alignment):
Code: add rax,[mask_lo] and rax,[mask_hi] or Code: mov rcx,[mask_lo] add rax,rcx not rcx and rax,rcx |
|||
21 May 2023, 10:22 |
|
Roman 21 May 2023, 10:42
Code: mov rcx,[mask_lo] ;this 3 asm commands modern CPU executed one tick. add rax,rcx not rcx and rax,rcx |
|||
21 May 2023, 10:42 |
|
revolution 21 May 2023, 11:15
sylware: Test them. We don't know how it will run in your code on your systems.
|
|||
21 May 2023, 11:15 |
|
sylware 21 May 2023, 18:33
@Roman
Code: mov rcx,[mask_lo] add rax,rcx andn rax,rcx,rax And those add+andn could be fused in a add_andn rax,rcx µop, but I doubt it. I don't know why, but the less memory accesses I do, the better I feel. But if internally the mask_hi was loaded in a hidden register during the same ticks than mask_lo (since it follows mask_lo), this is 1 too many tick unless there is a add+andn µop fusion. And I would not be surprised to see a full cache line (or good chunk of it) speculatively loaded in hidden registers. |
|||
21 May 2023, 18:33 |
|
Goto page Previous 1, 2, 3 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.