L1 cache lines

Herb Sutter gave an interesting talk titled Machine Architecture: Things Your Programming Language Never Told You:

In it he gives a tiny yet very interesting example of code that illustrates hardware destructive interference: how the size of L1 cache line and improper data layout can negatively affect performance of your code.
The example program allocates two int’s on the heap one right next to the other. It then starts two threads; each thread reads and writes to one of the int’s location. Let’s do just that, 100’000’000 times and see how long it takes:

Duration: 4338.55 ms

Let us now do the same exact thing, except this time we’ll space the int’s apart, by… you guessed it, L1 cache line size (64 bytes on Intel and AMD x64 chips):

Duration: 1219.50 ms

The same code now runs 4 times faster. What happens is that the L1 caches of the CPU cores no longer have to be synchronized every time we write to a memory location.

Lesson here is that data layout in memory matters. If you must run multiple threads that perform work and write to memory locations, make sure those memory locations are separated by L1 cache line size. C++17 helps us with that: Hardware Constructive and Destructive Interference Size.

Complete listing:

#include <iostream>
#include <thread>
#include <chrono>
#include <cstdlib>

using namespace std;
using namespace chrono;

const int CACHE_LINE_SIZE = 64;//sizeof(int);
const int SIZE = CACHE_LINE_SIZE / sizeof(int) + 1;
const int COUNT = 100'000'000;

int main(int argc, char** argv)
{
    srand((unsigned int)time(NULL));

    int* p = new int [SIZE];

    auto proc = [](int* data) {
        for(int i = 0; i < COUNT; ++i)
            *data = *data + rand();
    };
    
    auto start_time = high_resolution_clock::now();
    
    std::thread t1(proc, &p[0]);
    std::thread t2(proc, &p[SIZE - 1]);
    
    t1.join();
    t2.join();
    
    auto end_time = high_resolution_clock::now();
    cout << "Duration: " << duration_cast<microseconds>(end_time - start_time).count() / 1000.f << " ms" << endl;

    return 1;
}

#include <iostream>

#include <thread>

#include <chrono>

#include <cstdlib>

using namespace std;

using namespace chrono;

const int CACHE_LINE_SIZE = 64;//sizeof(int);

const int SIZE = CACHE_LINE_SIZE / sizeof(int) + 1;

const int COUNT = 100'000'000;

int main(int argc, char** argv)

{

srand((unsigned int)time(NULL));

int* p = new int [SIZE];

auto proc = [](int* data) {

for(int i = 0; i < COUNT; ++i)

*data = *data + rand();

};

auto start_time = high_resolution_clock::now();

std::thread t1(proc, &p[0]);

std::thread t2(proc, &p[SIZE - 1]);

t1.join();

t2.join();

auto end_time = high_resolution_clock::now();

cout << "Duration: " << duration_cast<microseconds>(end_time - start_time).count() / 1000.f << " ms" << endl;

return 1;

}

6 Replies to “L1 cache lines”

Draghi Puterity says:

February 2, 2019 at 4:47 pm

I can reproduce your numbers in debug mode, but in release mode the difference is much smaller.

Debug
~270 ms – CACHE_LINE_SIZE = 64;
~630 ms – CACHE_LINE_SIZE = sizeof(int);

Release
~240 ms – CACHE_LINE_SIZE = 64;
~270 ms – CACHE_LINE_SIZE = sizeof(int);

Compiled on a Lenovo with VS2017 on an Windows 7 64 bit i7-3720QM @ 2.6 GHz

Any idea why?

Loading...

1. Martin Vorbrodt says:
  
  February 2, 2019 at 5:07 pm
  
  Yes. In release the loop got optimized away. I will update the listing to work better in release mode.
  
  Loading...
  
Pingback: Memory barriers and thread synchronization – Vorbrodt's C++ Blog
degski says:

April 5, 2019 at 11:34 pm

Doesn’t alignas ( 64 ) int a, b; do the job [instead of the array]?

Loading...

1. Martin Vorbrodt says:
  
  April 6, 2019 at 7:42 am
  
  Yes it does! I’ll make a post about it.
  
  Loading...
  
Pingback: Data alignment the C++ way – Vorbrodt's C++ Blog

Practical Modern C++

L1 cache lines

Like this:

Related

6 Replies to “L1 cache lines”

Leave a ReplyCancel reply

Practical Modern C++

L1 cache lines

Share this:

Like this:

Related

6 Replies to “L1 cache lines”

Leave a ReplyCancel reply