flat assembler
Message board for the users of flat assembler.

Index > Main > multicore load/store ordering pitfalls

Author
Thread Post new topic Reply to topic
sylware



Joined: 23 Oct 2020
Posts: 462
Location: Marseille/France
sylware 04 Nov 2022, 13:33
In the context of memory with WB cache.

On 1 core, non-same-location loads can happen before stores (and with speculative execution it is even more true).

But in a multicore scenario, what are the cases to be very careful about?

For instance, since cores can share some cache, how to push some stores stuck into the write buffers of one core out in order to be "seen" by the other cores?
[ms]fence and other "serializing instructions" are "violent" as they push everything down to dram memory and pushing down to the first level of shared cache between cores should be enough.
Post 04 Nov 2022, 13:33
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20451
Location: In your JS exploiting you and your system
revolution 04 Nov 2022, 13:51
All the magic for cache coherency is done by the MESI protocol.

You don't have to worry about it. You have no control over it anyway. And it works just fine.
Post 04 Nov 2022, 13:51
View user's profile Send private message Visit poster's website Reply with quote
sylware



Joined: 23 Oct 2020
Posts: 462
Location: Marseille/France
sylware 04 Nov 2022, 20:36
Interesting.

I had a look at this wikipedia article, and there is a store buffer section.

But there, they say that a memory barrier will flush the store buffer up to the cache only, but I recall reading amd datasheet about memory ordering, memory barriers would flush stores to the dram.

Am I delusional? Unable to read a datasheet?
Post 04 Nov 2022, 20:36
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20451
Location: In your JS exploiting you and your system
revolution 04 Nov 2022, 22:30
Depending upon the CPU and system configuration a memory barrier instruction might force all data to be coherent all the way to the DRAM (very slow), or just to the highest level cache (less slow). It really doesn't matter though as long as the CPU delivers a consistent state to all the cores.

Using a memory barrier in user code is very rare, and not needed usually. Most usage is to synchronise and order data for external peripherals.
Post 04 Nov 2022, 22:30
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2565
Furs 05 Nov 2022, 15:01
Note that this is mostly true for x86 CPUs since they have strong memory ordering. It's not the same on other CPU architectures where you need to add barriers in a lot of cases with multi-threading.
Post 05 Nov 2022, 15:01
View user's profile Send private message Reply with quote
sylware



Joined: 23 Oct 2020
Posts: 462
Location: Marseille/France
sylware 05 Nov 2022, 17:14
RISC-V is one of those architectures, where you have to program like this as far as I know.

That said, iret on x86_64 is a synchronizing instruction, "locked xchg" instruction too, so it sort of happens all the time on x86_64.
Post 05 Nov 2022, 17:14
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.