flat assembler
Message board for the users of flat assembler.

Index > Main > L1 cache latency vs recompute

Goto page Previous  1, 2, 3  Next
Author
Thread Post new topic Reply to topic
Furs



Joined: 04 Mar 2016
Posts: 2521
Furs 17 May 2023, 13:17
sylware wrote:
I thought of something: I know we don't have a lot of registers, but why not restricting ourselves to loads/stores and regs-only instructions. Kind of risc-v, but with half the registers.
Because then you'll get more decoder overhead (more instructions for same thing), more code bloat, less instruction cache, etc. Also memory loads with arithmetic instructions sometimes are macro-fused and have special micro-ops for it, but it really depends on the CPU.
Post 17 May 2023, 13:17
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4046
Location: vpcmpistri
bitRAKE 17 May 2023, 16:44
revolution wrote:
Useful IMO would be content addressable memory (like the cache tags are). Then key lookups can be done in parallel by the memory, returning either {the first, or a list of, matched address(es)}, or {the value part of a key:value pair}.
That seems like the endgame for some scenarios in the future.

My suggestion was to potentially provide a way to alleviate pressure on cache hierarchies and improve overall system performance. Something that could be implemented simply at first with changes in the integrated memory controller; and then (migrate?) improve along with memory technologies.

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 17 May 2023, 16:44
View user's profile Send private message Visit poster's website Reply with quote
sylware



Joined: 23 Oct 2020
Posts: 441
Location: Marseille/France
sylware 17 May 2023, 18:36
@furs, I have a hard time thinking about a fused µop between a memory load and arthmetics. I would understand a fused µop for several arthimetics instructions (for instance multiply-add).

@bitrake do you know of any serious attempt of a multi-core processor with only cache memory ? (we need a bit of lantency to arbitrate the accesses among cores, so maybe only 2 levels of cache memory).
Post 17 May 2023, 18:36
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4046
Location: vpcmpistri
bitRAKE 17 May 2023, 19:26
I'm not familiar with the research on cache only memory architecture (COMA). Looks to be quite active in the 90's. Non-locality is just so expensive as the number of cores increase. Historically, I think many of these concepts were explored with supercomputers - everything from no common memory to all common memory; and we got to the present approach through trying to program in those environments.

Not that there isn't room for innovation. Just that we should respect the past and ask what has changed?

Does the increase in languages like Rust make the added complexity of programming in COMA more advantageous? How does this couple with a heterogenous environment of specialized cores?
Post 17 May 2023, 19:26
View user's profile Send private message Visit poster's website Reply with quote
sylware



Joined: 23 Oct 2020
Posts: 441
Location: Marseille/France
sylware 17 May 2023, 21:22
@bitrake: a lot have changed, the hardware is much, MUCH, faster at the silicium level. It is all about wafer real estate: up to how much cache memory we can reasonably cram there. 8GB? 16GB? 2 or 3 levels (or 1!!) which would depend on the number of cores.

Compiled languages will become more and more obsolete as the risc-v ISA worldwide and royalty-free standard gets traction and core components/very high level language interpreters are ported to human assembly.
Then planned obsolescence will be mostly from critical applications written with very high level languages. Ofc, this won't happen if risc-v is a failure.

Rust is already obsolete, but a good defence against the c++ zealots as, it seems, rust syntax is orders of magnitude simpler than c++.
Post 17 May 2023, 21:22
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20361
Location: In your JS exploiting you and your system
revolution 18 May 2023, 00:22
Having a super-large cache memory is probably not much value. Larger memories are slower. So after a certain size it becomes too slow and you lose all the main benefits. Plus the expense to create all that SRAM would make the CPU too costly (both in dollar and power budgets). There isn't much of a win with large caches.
Post 18 May 2023, 00:22
View user's profile Send private message Visit poster's website Reply with quote
sylware



Joined: 23 Oct 2020
Posts: 441
Location: Marseille/France
sylware 18 May 2023, 11:05
@revolution this is exactly why I am curious about this: what are those numbers on modern silicium?

There is also something: with 8GB or 16GB on the CPU die, no DIMM modules anymore. And that is surely a win.
Post 18 May 2023, 11:05
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20361
Location: In your JS exploiting you and your system
revolution 18 May 2023, 11:29
8GB in the chip package will be DRAM. Not really suitable for cache.

It probably can perform some operations better than mobo mounted DRAM because of two main reasons. 1) It is physically closer and the connections can be optimised, and 2) it can have a wider data bus giving more bits per transfer. Many GPUs have these types of HBM modules included.

But the downside is the lack of upgrade path. You are stuck with what you get. And in general DRAM has poor latency compared to SRAM. You can get good bandwidth with DRAM, but only on predictable bulk transfers. For random access it is awful.
Post 18 May 2023, 11:29
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2521
Furs 18 May 2023, 14:19
sylware wrote:
@furs, I have a hard time thinking about a fused µop between a memory load and arthmetics. I would understand a fused µop for several arthimetics instructions (for instance multiply-add).
I mean you can just search.

This shows cmp with memory operand can both macro and micro fuse: https://stackoverflow.com/questions/56413517/what-is-instruction-fusion-in-contemporary-x86-processors

Also see here: https://news.ycombinator.com/item?id=16306359:
BeeOnRope wrote:
4) The more interesting case of "complex" fused instructions is probably load-op, which only generate 1 (fused) uop so are eligible for the fast 1-1-1-1 decode treatment.

There is pretty much no downside[1] to using "complex" memory destination RMW and memory source load-op instructions in modern x86, and all decent compilers generate them when they can.
Post 18 May 2023, 14:19
View user's profile Send private message Reply with quote
sylware



Joined: 23 Oct 2020
Posts: 441
Location: Marseille/France
sylware 19 May 2023, 16:06
@revolution with apple mX die size, how much cache memory I could fit on such along a few cores? That with modern silicium process.

@furs, yeah, it seems some loads get fused with their major operations. Namely as soon as the data is loaded the actual operation is 1clk done.

But the author says that gcc is starting to prefer to load then use regs only instructions: this is what I was talking about. So, things are happening here, thus my question.
Post 19 May 2023, 16:06
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4046
Location: vpcmpistri
bitRAKE 19 May 2023, 19:04
The AMD 3D packages only add 64MB of L3 cache and that covers half an eight core CCD die. 8GB of SRAM would need 128 times the area/power. Maybe some distant future? [1]

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 19 May 2023, 19:04
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20361
Location: In your JS exploiting you and your system
revolution 19 May 2023, 21:20
With 128 x 16MB it doesn't act like a cache anymore because it would be very slow. The main goal of cache is to be fast memory, so it necessarily has to be small.
Post 19 May 2023, 21:20
View user's profile Send private message Visit poster's website Reply with quote
sylware



Joined: 23 Oct 2020
Posts: 441
Location: Marseille/France
sylware 19 May 2023, 21:56
would would would.

Did somebody actually try?

Or at least, did output some real detailed numbers?
Post 19 May 2023, 21:56
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20361
Location: In your JS exploiting you and your system
revolution 19 May 2023, 21:58
Nobody tried because it is impossible to make with the tech we have now.

But the laws of physics dictate the outcome. Ya cannae break the laws of physics.
Post 19 May 2023, 21:58
View user's profile Send private message Visit poster's website Reply with quote
sylware



Joined: 23 Oct 2020
Posts: 441
Location: Marseille/France
sylware 20 May 2023, 09:22
In this context, it would remove the need of memory modules, and that surely would "bother" some companies. I believe this is the main reason why it was "not tried". I will be pleased to be proven wrong.
Post 20 May 2023, 09:22
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2521
Furs 20 May 2023, 19:56
sylware wrote:
But the author says that gcc is starting to prefer to load then use regs only instructions: this is what I was talking about. So, things are happening here, thus my question.
Using register-only operands is still faster than using memory operand.

The point is that splitting a memory operand instruction into separate load, and then one register-only operation a la RISC, is useless and harms performance (and adds code bloat).

If you're going to operate on the same memory data for several instructions in a row, then it's better to load it into a register, of course. Just don't load it if you're only going to use it once from that register.

e.g.
bad:
Code:
mov eax, [var]
add ebx, eax    
bad:
Code:
add ebx, [var]
sub ecx, [var]
or edx, [var]    
good:
Code:
add ebx, [var]    
good:
Code:
mov eax, [var]
add ebx, eax
sub ecx, eax
or edx, eax    
Post 20 May 2023, 19:56
View user's profile Send private message Reply with quote
sylware



Joined: 23 Oct 2020
Posts: 441
Location: Marseille/France
sylware 21 May 2023, 10:22
What about this one (power of 2 alignment):
Code:
add rax,[mask_lo]
and rax,[mask_hi]
    

or
Code:
mov rcx,[mask_lo]
add rax,rcx
not rcx
and rax,rcx
    
Post 21 May 2023, 10:22
View user's profile Send private message Reply with quote
Roman



Joined: 21 Apr 2012
Posts: 1808
Roman 21 May 2023, 10:42
Code:
mov rcx,[mask_lo]
;this 3 asm commands modern CPU executed one tick. 
add rax,rcx
not rcx
and rax,rcx    
Post 21 May 2023, 10:42
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20361
Location: In your JS exploiting you and your system
revolution 21 May 2023, 11:15
sylware: Test them. We don't know how it will run in your code on your systems.
Post 21 May 2023, 11:15
View user's profile Send private message Visit poster's website Reply with quote
sylware



Joined: 23 Oct 2020
Posts: 441
Location: Marseille/France
sylware 21 May 2023, 18:33
@Roman

Code:
mov rcx,[mask_lo]
add rax,rcx
andn rax,rcx,rax
    


And those add+andn could be fused in a add_andn rax,rcx µop, but I doubt it.

I don't know why, but the less memory accesses I do, the better I feel.

But if internally the mask_hi was loaded in a hidden register during the same ticks than mask_lo (since it follows mask_lo), this is 1 too many tick unless there is a add+andn µop fusion. And I would not be surprised to see a full cache line (or good chunk of it) speculatively loaded in hidden registers.
Post 21 May 2023, 18:33
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.