flat assembler
Message board for the users of flat assembler.
 Home   FAQ   Search   Register 
 Profile   Log in to check your private messages   Log in 
flat assembler > High Level Languages > memory fence, lockless

Author
Thread Post new topic Reply to topic
vivik



Joined: 29 Oct 2016
Posts: 379

memory fence, lockless

I'm reading this: Lockless Programming Considerations for Xbox 360 and Microsoft Windows.

I only care about windows right now.

When you specify MemoryBarrier(); , do you still need _ReadWriteBarrier(); or _ReadBarrier(); or _WriteBarrier(); ? I can't compile any of the latter 3, in what header those are?

This is a really scary technique, this has a potential for bugs that appear once per 100000 runs, or bugs that appear only on some CPUs (the latter was made up by me, no basis for this). I need to get it right.
Post 12 Mar 2018, 21:19
View user's profile Send private message Reply with quote
donn



Joined: 05 Mar 2010
Posts: 100

I'm curious what the answer to you question is also, just some thoughts of mine though:

"These instructions also ensure that the compiler disables any optimizations that could reorder memory operations across the barriers."

- source. MemoryBarrier prevents both CPU and compiler level reordering, so it is stronger than the others. This was also discussed here.

I agree, it's scary stuff, especially since there are many multi-processing platforms in use today. I'm newer to asm-level CPU multithreading (more experienced with GPU-only atomics), so you may want a second or third opinion here. Reading about the SSE memfence, xchg, the xadd instruction, and so on at the asm level may clear stuff up too. If you narrow down the scope to what you're trying to achieve, it may be even safer. XBox reordering rules are different for example, from x86, and if you can manage the performance hit, locks are far safer. Also, distinguishing between atomic and composite operations may help, but the article you referenced shows the reordering is much worse on XBox:
"Even though x86 and x64 CPUs do reorder instructions, they generally do not reorder write operations relative to other writes. "
Post 13 Mar 2018, 00:27
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15727
Location: (514107) 2015 BZ509

Moved to HLL forum.
Post 13 Mar 2018, 01:29
View user's profile Send private message Visit poster's website Reply with quote
vivik



Joined: 29 Oct 2016
Posts: 379

Hm, I will set memory fence aside for now, need to make an app that doesn't burn cpu unnecessarily...

What is the proper way to suspend a thread? My plan right now is:

1) At first, CreateThread with CREATE_SUSPENDED

2) When I need thread to work, I call ResumeThread

3) When I need thread to stop, I call SwitchToThread

Am I wrong? Maybe it's better to just exit the thread, instead of reusing it? Is it ok to call ResumeThread while it's still working?

I want to switch jpg decompression into a separate thread, so that gui will lag a bit less.

Also, I know about "minimize the data sharing between threads", I'm yet figuring it out. I will use separate heaps or something.
Post 14 Mar 2018, 08:58
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15727
Location: (514107) 2015 BZ509

If you want to pause a thread then use the any of the Wait* functions. Or just Sleep. It is usually best if threads handle their own runtime.
Post 14 Mar 2018, 10:41
View user's profile Send private message Visit poster's website Reply with quote
vivik



Joined: 29 Oct 2016
Posts: 379

You mean "WaitForSingleObject" and such?
Post 14 Mar 2018, 14:02
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15727
Location: (514107) 2015 BZ509

Yes.
Post 14 Mar 2018, 14:15
View user's profile Send private message Visit poster's website Reply with quote
vivik



Joined: 29 Oct 2016
Posts: 379

Hm, SwitchToThread isn't just "Sleep", more on that here, answer by Maxim Masiutin

https://stackoverflow.com/questions/1383943/switchtothread-vs-sleep1

I probably wouldn't use neither Sleep, nor SwitchToThread. What I was planning to do is called polling, and as those links said, it's like "checking your watch every minute to see if it's 3 o'clock yet instead of just setting an alarm". I guess it's useful if you are sure the second thread will run like 75% of the time anyway, and quick response is necessary.

https://blogs.msdn.microsoft.com/oldnewthing/20090727-00/?p=17353
https://blogs.msdn.microsoft.com/oldnewthing/20060124-17/?p=32553

Nice overview of what I can possibly do on windows:

>If you just want to make the thread wait for a fixed amount of time, use Sleep() for that. But if you want the wait to be interruptable by some external operation, you have to wait on something, whether that be a timer (see SetTimer() and Get/PeekMessage()), an event (see CreateEvent() and WaitForSingleObject()), a message in the thread's queue (see PostThreadMessage() and MsgWaitForMultipleObjects()), an I/O Completion Port callback (see PostQueuedCompletionStatus() and GetQueuedCompletionStatus()), etc.

https://stackoverflow.com/questions/12489234/wait-for-a-specific-time-in-thread-use-waitforsingleobject

So, it's either CreateEvent WaitForSingleObject, or PostThreadMessage MsgWaitForMultipleObjects. I'm not sure what they do, is it possible for each thread to have it's own message loop or something?

By the way, there is some odd warning against WaitForSingleObject, something about deadlock if you use COM and multiple threads and something, I'm not sure if this applies to me.

https://marc.durdin.net/2012/08/waitforsingleobject-why-you-should-never-use-it/
Post 14 Mar 2018, 15:20
View user's profile Send private message Reply with quote
DimonSoft



Joined: 03 Mar 2010
Posts: 231
Location: Belarus


vivik wrote:
So, it's either CreateEvent WaitForSingleObject, or PostThreadMessage MsgWaitForMultipleObjects. I'm not sure what they do, is it possible for each thread to have it's own message loop or something?


Yes, each thread gets its own message queue as soon as it calls one of a set of USER32 functions.

In your case CreateEvent + (Re)SetEvent + WaitForMultipleObjects seems to be the best way, since I doubt you really want to put a message loop into the worker thread. Note also that you might need two events: one for controlling the work being done, the other for thread termination.


vivik wrote:
By the way, there is some odd warning against WaitForSingleObject, something about deadlock if you use COM and multiple threads and something, I'm not sure if this applies to me.

https://marc.durdin.net/2012/08/waitforsingleobject-why-you-should-never-use-it/


The problems described there are related to threads which need to handle window/thread messages and thus must have a message loop. If your worker thread is for some work that doesn’t use COM, you’re on the safe side since you’re then the one who controls the thread.
Post 14 Mar 2018, 20:54
View user's profile Send private message Visit poster's website Reply with quote
vivik



Joined: 29 Oct 2016
Posts: 379

According to this http://www.bogotobogo.com/cplusplus/multithreading_win32A.php , there are 3 different ways to create a thread: CreateThread, _beginthread and _beginthreadex. The first one doesn't create thread-local storage for you, so this is the one I will use.

Thread-local storage is, um, something like a global variable, but its value is different for each thread. It's often used by some memory allocators, and when there are many threads. I will use only 2 threads for now, so I'll live without it. I'll just use a local variable instead of this thread-local global. Also looks like windows xp doesn't support something about TLS, and I want my programs to work even there.
Post 15 Mar 2018, 17:03
View user's profile Send private message Reply with quote
vivik



Joined: 29 Oct 2016
Posts: 379

There is [main thread], and there is [worker thread]

Simple case is this:

[main thread] --decompress this please--> [worker thread]
[main thread] <--ok, done-- [worker thread]




But most likely it will be something like this:

[main thread] --decompress this please--> [worker thread]
[main thread] --decompress this please--> [worker thread]
[main thread] --decompress this please--> [worker thread]
[main thread] <--ok, done-- [worker thread]
[main thread] --decompress this please--> [worker thread]
[main thread] <--ok, done-- [worker thread]
[main thread] <--ok, done-- [worker thread]
[main thread] <--ok, done-- [worker thread]
[main thread] --decompress this please--> [worker thread]
[main thread] <--ok, done-- [worker thread]



I guess I will make the message exchange through the ring of, let's say, 0x20 dwords. First dword is "alive" flag, second dword is the meaningful data, and it goes in circle.

The --decompress this please--> will look like this:

Code:
i=0
li=0
dword ring[0x20]={0}



Code:
ring[i+1]="C:/loadmepls.jpg"
MemoryFence
ring[i]=1
i+=2



The <--ok, done-- will look like this:

Code:
while true:
    if ring[i]==0:
        sleep or some other winapi function that will pause the thread
    else:
        path = ring[i+1]
        decompressed_data = decompress(path)
        ring[i+1] = decompressed_data
        MemoryFence
        ring[i]=0
        i+=2



The waiting for "ok, done":

Code:
if ring[li]==0:
    decompressed_data = ring[li+1]
    load_to_gpu(decompressed_data)
    li+=2






I guess I will use CreateEvent WaitForSingleObject for --decompress this please--> (so that it doesn't waste cpu when there is no need for this), and that ring for <--ok, done--

Using the per thread message loop will simplify things and my life, but I'm not sure if it's the right thing to do. This ring looks more lightweight than the usual message loop.

Also, if I malloc'd some memory in [main thread], I shouldn't free it in the [worker thread], and vice versa? That "C:/loadmepls.jpg" is being overwritten without getting freed. Should I use 2 or 3 rings instead, and add a --uploaded to gpu, now free memory--> message? The os provided heaps are probably thread safe, but I'm planning on replacing them with my own... With a custom memory allocator, more tuned for this program. Something like in golang, but more dumb.

Eh, just make it work first, make it good afterwards. This looks like a good opportunity to learn.
Post 15 Mar 2018, 17:34
View user's profile Send private message Reply with quote
DimonSoft



Joined: 03 Mar 2010
Posts: 231
Location: Belarus


vivik wrote:
According to this http://www.bogotobogo.com/cplusplus/multithreading_win32A.php , there are 3 different ways to create a thread: CreateThread, _beginthread and _beginthreadex. The first one doesn't create thread-local storage for you, so this is the one I will use.

Thread-local storage is, um, something like a global variable, but its value is different for each thread. It's often used by some memory allocators, and when there are many threads. I will use only 2 threads for now, so I'll live without it. I'll just use a local variable instead of this thread-local global. Also looks like windows xp doesn't support something about TLS, and I want my programs to work even there.


It’s not about TLS. TLS is implemented as a separate set of functions in Windows API. It’s not CreateThread that doesn’t create a TLS, it’s _beginthread[ex] that (might) do.

If memory serves me (I haven’t used C/C++ for ages), _beginthreadex is the way to go if you expect the CRT to work correctly on the thread. Since this includes memory management which is done automatically in some places in C++, you’ll almost always want to use _beginthreadex.

Also a recent article by Raymond Chen about the cons of _beginthread: If I call GetExitCodeThread for a thread that I know for sure has exited, why does it still say STILL_ACTIVE?
Post 15 Mar 2018, 17:46
View user's profile Send private message Visit poster's website Reply with quote
vivik



Joined: 29 Oct 2016
Posts: 379

>If memory serves me (I haven’t used C/C++ for ages), _beginthreadex is the way to go if you expect the CRT to work correctly on the thread. Since this includes memory management which is done automatically in some places in C++, you’ll almost always want to use _beginthreadex.

Hm, yes, I will use CreateThread. I'm trying to get rid of crt anyway, I'm using C instead of C++ (kind of), and I'll use a custom memory allocator. Don't ask.
Post 15 Mar 2018, 20:07
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15727
Location: (514107) 2015 BZ509


vivik wrote:
Also, if I malloc'd some memory in [main thread], I shouldn't free it in the [worker thread], and vice versa?

I would suggest that each thread manage its own memory.

Have a clear policy on memory ownership. You can send pointers to allocated memory to another thread and also transfer ownership to that thread. Or if you expect to get results returned to you in the memory then you can pass a pointer to it and retain ownership. The last owner of each memory allocation is responsible to either free the memory or transfer ownership to another thread.
Post 15 Mar 2018, 22:44
View user's profile Send private message Visit poster's website Reply with quote
vivik



Joined: 29 Oct 2016
Posts: 379

Found more info on it, it's the simpliest case of lock-free.

https://en.wikipedia.org/wiki/Non-blocking_algorithm


Quote:
a single-reader single-writer ring buffer FIFO, with a size which evenly divides the overflow of one of the available unsigned integer types, can unconditionally be implemented safely using only a memory barrier



https://en.wikipedia.org/wiki/Producer%E2%80%93consumer_problem#Without_semaphores_or_monitors

One thing I don't get yet, should I waste an entire cache line (64 bytes, probably) for each counter? This is to avoid "false sharing", which will probably happen no matter what. Some quotes:

https://stackoverflow.com/questions/16699247/what-is-cache-friendly-code


Quote:
false sharing

This occurs when each individual processor is attempting to use data in another memory region and attempts to store it in the same cache line. This causes the cache line -- which contains data another processor can use -- to be overwritten again and again.

Effectively, different threads make each other wait by inducing cache misses in this situation.

An extreme symptom of poor caching in RAM memory (which is probably not what you mean in this context) is so-called thrashing. This occurs when the process continuously generates page faults (e.g. accesses memory which is not in the current page) which require disk access.



https://stackoverflow.com/questions/14707803/line-size-of-l1-and-l2-caches


Quote:
Cache-Lines size is (typically) 64 bytes.



Um, for now I will place 6 such counters into the same 64bytes cache line, just in case.
Post 16 Mar 2018, 16:12
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >

Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Main index   Download   Documentation   Examples   Message board
Copyright © 2004-2018, Tomasz Grysztar.
Powered by rwasa.