We will jump straight to the code. This innocent looking little program has a major issue (when compiled for release build with optimizations on my Mac using GCC, Apple’s CLANG, and LLVM, as well as on Windows using Visual Studio 2017, and ran on a multicore machine). Can you spot the problem?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
#include <iostream> #include <thread> using namespace std; int main(int argc, char** argv) { bool flag = false; thread t1([&]() { this_thread::sleep_for(100ms); cout << "t1 started" << endl; flag = true; cout << "t1 signals and exits" << endl; }); thread t2([&]() { cout << "t2 started" << endl; while(flag == false) ; cout << "t2 got signaled and exits" << endl; }); t1.join(); t2.join(); return 1; } |
That’s right! It will never terminate! It will hang forever! The while loop in line 18 will never break. But why? Thread t1 sets flag to true after all. Yes, but it does so too late (notice the 100ms sleep). At that point thread t2 has already L1 cached flag and will never see its updated value. If you think that making flag volatile will help you’re wrong. It may work on your compiler/machine but it is no guarantee. Now what?
This was one of the hardest lessons in C++ and computer science for me. Before continuing to the fix section I highly recommend you read about the following: memory barriers, C++ memory model as well as C++ Memory Model at Modernest C++, and memory ordering. I’ll see you in a couple of days 😉
The fix.
The simplest fix is to wrap access to flag around a mutex lock/ unlock or make flag an atomic<bool>;(both of those solutions will insert appropriate memory barriers). But that’s not always an option for other data types…
We need to make sure that
t2 can see the actions of
t1 that happened later in time. For this we need to force cache synchronization between different CPU cores. We can do it in three ways:
1) By inserting memory barriers in the right places.
2) By inserting loads and stores of an atomic variable using release/acquire semantics.
3) By inserting loads and stores of a dependent atomic variable using release/consume semantics.
Below is the corrected version of our example; uncomment each #define to engage different fixes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
#include <iostream> #include <atomic> #include <thread> using namespace std; //#define ATOMIC_FENCE //#define ATOMIC_RELEASE //#define ATOMIC_CONSUME #if defined ATOMIC_FENCE #define FENCE_ACQUIRE atomic_thread_fence(memory_order_acquire) #define FENCE_RELEASE atomic_thread_fence(memory_order_release) #elif defined ATOMIC_RELEASE atomic_bool f{false}; #define FENCE_ACQUIRE f.load(memory_order_acquire) #define FENCE_RELEASE f.store(true, memory_order_release) #elif defined ATOMIC_CONSUME atomic_bool f{false}; #define FENCE_ACQUIRE f.load(memory_order_consume) #define FENCE_RELEASE f.store(flag, memory_order_release) #else #define FENCE_ACQUIRE #define FENCE_RELEASE #endif int main(int argc, char** argv) { bool flag = false; thread t1([&]() { this_thread::sleep_for(100ms); cout << "t1 started" << endl; flag = true; FENCE_RELEASE; cout << "t1 signals and exits" << endl; }); thread t2([&]() { cout << "t2 started" << endl; while(flag == false) FENCE_ACQUIRE; cout << "t2 got signaled and exits" << endl; }); t1.join(); t2.join(); return 1; } |
I don’t believe this issue has to do with hardware caching. In fact, this is an example of the one time that volatile is the correct keyword: the issue isn’t that the CPU has cached the data, it’s that the compiler only issues a single load for the data. You can verify this by taking a quick look at the assembly with and without the volatile keyword. The volatile keyword (which is used as part of the underlying implementation of std::atomic) tells the compiler that the variable could change due to outside forces, forcing it to reload on every read. The CPU then ensures that each load is consistent across cores.
For an example of where you need memory fences, you would need another variable. E.g., try adding another bool “value” which you initialize to false and then set to true before setting the flag in t1. Then read it after escaping the loop in t2, but just using volatile on the flag instead of using the atomic operations or fences. If you set things up just right (ensuring that flag and value are on different cache lines) and you happen on an unlucky timing, you may see value print as false on the second thread. Fixing that requires one of the methods you describe.
Hope that helps some. It’s definitely tricky stuff.
Have a look at the articles here: https://preshing.com/20120913/acquire-and-release-semantics/ they are excellent.
I agree with Falcon that this has nothing to do with caching inside of the cache hierarchy. It is just that a thread may generally assume that data it reads is not changed by another thread – unless synchronization mechanisms are used.
OTOH, volatile is never the right answer when it comes to multithreading (at least when following the C++ standard, in practice it will work, especially on x86 which is very forgiving WRT memory consistency).
I think I have to agree. The more I think about it the more I realize this is a function of compiler optimization and not cache consistency. What I don’t understand then is why the presence of atomics or fences appears to fix the bad behavior…
Because this is their purpose.
Locking or unlocking mutexes form synchronization points which defined points in the execution stream where 1) previous changes must actually be written to memory and 2) data copies in registers may have become stale as their memory location was altered by other threads.
C++’s atomics form such global synchronization points, too.
Could you include Godbolt link for the reproducer?
I believe your Mac is not x86 architecture but AArch64, because I haven’t been able to reproduce it on my Intel machine