Memory barriers and thread synchronization

We will jump straight to the code. This innocent looking little program has a major issue (when compiled for release build with optimizations on my Mac using GCC, Apple’s CLANG, and LLVM, as well as on Windows using Visual Studio 2017, and ran on a multicore machine). Can you spot the problem?

#include 
#include 
using namespace std;

int main(int argc, char** argv)
{
	bool flag = false;
	
	thread t1([&]() {
		this_thread::sleep_for(100ms);
		cout << "t1 started" << endl;
		flag = true;
		cout << "t1 signals and exits" << endl;
	});
	
	thread t2([&]() {
		cout << "t2 started" << endl;
		while(flag == false) ;
		cout << "t2 got signaled and exits" << endl;
	});
	
	t1.join();
	t2.join();
	
	return 1;
}

That's right! It will never terminate! It will hang forever! The

while

loop in line 18 will never break. But why? Thread

t1

sets

flag

true

after all. Yes, but it does so too late (notice the 100ms sleep). At that point thread

t2

has already L1 cached

flag

and will never see its updated value. If you think that making

flag

volatile

will help you're wrong. It may work on your compiler/machine but it is no guarantee. Now what?

This was one of the hardest lessons in C++ and computer science for me. Before continuing to the fix section I highly recommend you read about the following: memory barriers, C++ memory model as well as C++ Memory Model at Modernest C++, and memory ordering. I'll see you in a couple of days 😉

The fix.

The simplest fix is to wrap access to flag around a

mutex

lock

unlock

or make

flag

atomic<bool>;

(both of those solutions will insert appropriate memory barriers). But that's not always an option for other data types...

We need to make sure that

t2

can see the actions of

t1

that happened later in time. For this we need to force cache synchronization between different CPU cores. We can do it in three ways:
1) By inserting memory barriers in the right places.
2) By inserting loads and stores of an atomic variable using release/acquire semantics.
3) By inserting loads and stores of a dependent atomic variable using release/consume semantics.

Below is the corrected version of our example; uncomment each

#define

to engage different fixes:

#include 
#include 
#include 
using namespace std;

//#define ATOMIC_FENCE
//#define ATOMIC_RELEASE
//#define ATOMIC_CONSUME

#if defined ATOMIC_FENCE
#define FENCE_ACQUIRE atomic_thread_fence(memory_order_acquire)
#define FENCE_RELEASE atomic_thread_fence(memory_order_release)
#elif defined ATOMIC_RELEASE
atomic_bool f{false};
#define FENCE_ACQUIRE f.load(memory_order_acquire)
#define FENCE_RELEASE f.store(true, memory_order_release)
#elif defined ATOMIC_CONSUME
atomic_bool f{false};
#define FENCE_ACQUIRE f.load(memory_order_consume)
#define FENCE_RELEASE f.store(flag, memory_order_release)
#else
#define FENCE_ACQUIRE
#define FENCE_RELEASE
#endif

int main(int argc, char** argv)
{
	bool flag = false;

	thread t1([&]() {
		this_thread::sleep_for(100ms);
		cout << "t1 started" << endl;
		flag = true;
		FENCE_RELEASE;
		cout << "t1 signals and exits" << endl;
	});

	thread t2([&]() {
		cout << "t2 started" << endl;
		while(flag == false) FENCE_ACQUIRE;
		cout << "t2 got signaled and exits" << endl;
	});

	t1.join();
	t2.join();

	return 1;
}

6 Replies to “Memory barriers and thread synchronization”

I don’t believe this issue has to do with hardware caching. In fact, this is an example of the one time that volatile is the correct keyword: the issue isn’t that the CPU has cached the data, it’s that the compiler only issues a single load for the data. You can verify this by taking a quick look at the assembly with and without the volatile keyword. The volatile keyword (which is used as part of the underlying implementation of std::atomic) tells the compiler that the variable could change due to outside forces, forcing it to reload on every read. The CPU then ensures that each load is consistent across cores.

For an example of where you need memory fences, you would need another variable. E.g., try adding another bool “value” which you initialize to false and then set to true before setting the flag in t1. Then read it after escaping the loop in t2, but just using volatile on the flag instead of using the atomic operations or fences. If you set things up just right (ensuring that flag and value are on different cache lines) and you happen on an unlucky timing, you may see value print as false on the second thread. Fixing that requires one of the methods you describe.

Hope that helps some. It’s definitely tricky stuff.

Have a look at the articles here: https://preshing.com/20120913/acquire-and-release-semantics/ they are excellent.

I agree with Falcon that this has nothing to do with caching inside of the cache hierarchy. It is just that a thread may generally assume that data it reads is not changed by another thread – unless synchronization mechanisms are used.

OTOH, volatile is never the right answer when it comes to multithreading (at least when following the C++ standard, in practice it will work, especially on x86 which is very forgiving WRT memory consistency).

Martin Vorbrodt says:

February 23, 2019 at 3:41 pm

I think I have to agree. The more I think about it the more I realize this is a function of compiler optimization and not cache consistency. What I don’t understand then is why the presence of atomics or fences appears to fix the bad behavior…

Loading...

Reply
1. Markus says:
  
  February 23, 2019 at 7:39 pm
  
  Because this is their purpose.
  
  Locking or unlocking mutexes form synchronization points which defined points in the execution stream where 1) previous changes must actually be written to memory and 2) data copies in registers may have become stale as their memory location was altered by other threads.
  
  C++’s atomics form such global synchronization points, too.
  
  Loading...
  
  Reply

Could you include Godbolt link for the reproducer?

I believe your Mac is not x86 architecture but AArch64, because I haven’t been able to reproduce it on my Intel machine

Falcon says:

February 23, 2019 at 2:11 am

I don’t believe this issue has to do with hardware caching. In fact, this is an example of the one time that volatile is the correct keyword: the issue isn’t that the CPU has cached the data, it’s that the compiler only issues a single load for the data. You can verify this by taking a quick look at the assembly with and without the volatile keyword. The volatile keyword (which is used as part of the underlying implementation of std::atomic) tells the compiler that the variable could change due to outside forces, forcing it to reload on every read. The CPU then ensures that each load is consistent across cores.

For an example of where you need memory fences, you would need another variable. E.g., try adding another bool “value” which you initialize to false and then set to true before setting the flag in t1. Then read it after escaping the loop in t2, but just using volatile on the flag instead of using the atomic operations or fences. If you set things up just right (ensuring that flag and value are on different cache lines) and you happen on an unlucky timing, you may see value print as false on the second thread. Fixing that requires one of the methods you describe.

Hope that helps some. It’s definitely tricky stuff.

Have a look at the articles here: https://preshing.com/20120913/acquire-and-release-semantics/ they are excellent.

Loading...

Markus says:

February 23, 2019 at 3:38 pm

I agree with Falcon that this has nothing to do with caching inside of the cache hierarchy. It is just that a thread may generally assume that data it reads is not changed by another thread – unless synchronization mechanisms are used.

OTOH, volatile is never the right answer when it comes to multithreading (at least when following the C++ standard, in practice it will work, especially on x86 which is very forgiving WRT memory consistency).

Loading...

1. Martin Vorbrodt says:
  
  February 23, 2019 at 3:41 pm
  
  I think I have to agree. The more I think about it the more I realize this is a function of compiler optimization and not cache consistency. What I don’t understand then is why the presence of atomics or fences appears to fix the bad behavior…
  
  Loading...
  
  1. Markus says:
    
    February 23, 2019 at 7:39 pm
    
    Because this is their purpose.
    
    Locking or unlocking mutexes form synchronization points which defined points in the execution stream where 1) previous changes must actually be written to memory and 2) data copies in registers may have become stale as their memory location was altered by other threads.
    
    C++’s atomics form such global synchronization points, too.
    
    Loading...
    
Vinh says:

February 20, 2023 at 6:16 am

Could you include Godbolt link for the reproducer?

Loading...

Anonymous says:

February 25, 2024 at 5:02 am

I believe your Mac is not x86 architecture but AArch64, because I haven’t been able to reproduce it on my Intel machine

Loading...

Practical Modern C++

Memory barriers and thread synchronization

Like this:

Related

6 Replies to “Memory barriers and thread synchronization”

Leave a ReplyCancel reply

Practical Modern C++

Memory barriers and thread synchronization

Share this:

Like this:

Related

6 Replies to “Memory barriers and thread synchronization”

Leave a ReplyCancel reply