Below is my implementation of the thread pool described in this talk and a benchmark comparing it against my simple thread pool implementation. The advanced pool is 15x faster at scheduling and dispatching short random length work items on my 2018 MacBook Pro with i5 CPU and 4 logical cores. It uses a queue per worker thread and a work stealing dispatcher. It tries to enqueue the work items onto a queue that is not currently locked by a dispatch thread. It also tries to steal work from other unblocked queues. As always the complete implementation is available at GitHub.

Benchmark program:

* Apple CLANG -Ofast -march=native -std=c++17 -lc++

simple_thread_pool duration = 30.337 s
thread_pool duration = 1.625 s

* LLVM -Ofast -march=native -std=c++17 -lc++

simple_thread_pool duration = 25.785 s
thread_pool duration = 1.615 s

* G++ -Ofast -march=native -std=c++17 -lstdc++ *

simple_thread_pool duration = 26.28 s
thread_pool duration = 1.614 s

Program output.

thread_pool class:

14 Replies to “Advanced thread pool”

  1. Thanks for writing this. Just built your pool from git:

    both are the same speed for me?

    simple_thread_pool 100 1 38.2594 s
    379.447 ms 377.661 ms 382.583 ms
    11.8086 ms 7.64767 ms 17.8581 ms

    thread_pool 100 1 38.7423 s
    387.369 ms 386.488 ms 387.932 ms
    3.53217 ms 2.48396 ms 4.95731 ms

    That’s just running you built in tests.

    Is is that the “work” is too predictable to benefit from “task stealing etc”.

      1. I checked with verbose and it was all compiled with -O3. Lots of ransom tiny tasks are tricky to schedule that’s true. So I played with REPS etc but couldn’t create a difference.

        I had to add a PREFER PTHREADS to the cmake list to make it compile on my ubuntu 19.04 system.

        Btw why is there no “official” implementation for this sort of thing? Something in the STL or at least Boost or some other well maintained library.

        std:async is totally broken on linux at least. It runs with “deferred” (which is not parallel at all) or launches one thread per task and blows up. So a thread pool is totally needed. Am I missing something?

        Lots of talk about “better futures” (eg the Sean Parent futures library, or the new stuff in c++20), but where the “standard thread pool implementation”?

      2. This guy compared a bunch of random github projects:

        https://github.com/Fdhvdu/ThreadPool/tree/master/comparison

        And This one won (his own one). And it’s modern cod, etc

        https://github.com/Fdhvdu/ThreadPool

        But it is not at clear that it takes the Sean Parent Talk concepts into account (as your does):

        https://github.com/Fdhvdu/ThreadPool/blob/master/README.md
        “Future work

        add a non-block version of CThreadPool::add
        work stealing”

        ???

        It’s not that (!) complicated, but no “formal” solutions? Everyone hacking around on their own?

  2. Nice TP,
    Funny, I have never seen this but ended up producing a very similar pool likely due to the fact we both has inspiration from Parents talk. Only difference is the way I chose to handle the Task. I make a task that stores the arguments in a Tuple and uses a polymorphic Invoke so that I could store pointers to the base class allowing me to store functions with varying arguments. I found it to perform better than the solution similar to the one above which I tried first.

    Still in construction because I am working on making it a parent stealing queue. The one you have above fails when you attempt to make recursive calls to the TP. Try a Merge sort to see what I mean.

    https://github.com/xSeditx/Creature-Engine/blob/master/CreatureEngine/Core/Threading/Threadpool.h
    https://github.com/xSeditx/Creature-Engine/blob/master/CreatureEngine/Core/Threading/Threadpool.cpp

  3. Funny, I have never seen this but ended up producing a very similar pool likely due to the fact we both has inspiration from Parents talk. Only difference is the way I chose to handle the Task. I make a task that stores the arguments in a Tuple and uses a polymorphic Invoke so that I could store pointers to the base class allowing me to store functions with varying arguments. I found it to perform better than the solution similar to the one above which I tried first.

    Still in construction because I am working on making it a parent stealing queue. The one you have above fails when you attempt to make recursive calls to the TP. Try a Merge sort to see what I mean.

    https://github.com/xSeditx/Creature-Engine/blob/master/CreatureEngine/Core/Threading/Threadpool.h
    https://github.com/xSeditx/Creature-Engine/blob/master/CreatureEngine/Core/Threading/Threadpool.cpp

  4. I didn’t see much difference between the two thread pools. Although it was a fun way to kill an evening. 🙂

    AMD Phenom(tm) II X4 965 Processor
    4 Cores
    Arch Linux

    simple_thread_pool duration = 35.443 s
    thread_pool duration = 32.128 s

    simple (100,000 tasks, 100 reps) 1164.47 ms (10,000,000)
    advanced (100,000 tasks, 100 reps) 823.953 ms (10,000,000)

    simple (100,000 tasks, 200 reps) 2015.13 ms (20,000,000)
    advanced (100,000 tasks, 200 reps) 2469.78 ms (20,000,000)

    simple (100,000 tasks, 300 reps) 3775.12 ms (30,000,000)
    advanced (100,000 tasks, 300 reps) 3748.74 ms (30,000,000)

    simple (100,000 tasks, 400 reps) 4824.86 ms (40,000,000)
    advanced (100,000 tasks, 400 reps) 4554.73 ms (40,000,000)

    simple (100,000 tasks, 500 reps) 5721.67 ms (50,000,000)
    advanced (100,000 tasks, 500 reps) 5684.69 ms (50,000,000)

    simple (100,000 tasks, 600 reps) 6955.63 ms (60,000,000)
    advanced (100,000 tasks, 600 reps) 6838.74 ms (60,000,000)

    simple (100,000 tasks, 700 reps) 7943.49 ms (70,000,000)
    advanced (100,000 tasks, 700 reps) 7716.07 ms (70,000,000)

    simple (100,000 tasks, 800 reps) 8803.46 ms (80,000,000)
    advanced (100,000 tasks, 800 reps) 9205.39 ms (80,000,000)

    simple (100,000 tasks, 900 reps) 10698.1 ms (90,000,000)
    advanced (100,000 tasks, 900 reps) 10614.8 ms (90,000,000)

    simple (100,000 tasks, 1,000 reps) 11327.7 ms (100,000,000)
    advanced (100,000 tasks, 1,000 reps) 11496 ms (100,000,000)

    I ran the test longer then this but the results were pretty consistant.

    1. You are correct in the sense that as the duration of each individual task goes up the performance difference of the pools comes closer. Try tasks with 1 to 10 reps instead, you may find bigger performance gains from the advanced one.

  5. Hi Martin,

    why do you prefer the following signature as opposed to a simpler signature not including f’s arguments (they can be bound to the ‘work’ function externally)? This signature suggests perfect-forwarding, but I don’t think that really happens internally. Am I right?

    template
    void enqueue_work(F&& f, Args&&… args)
    {
    auto work = f,args… { f(args…); };
    unsigned int i = m_index++;
    for(unsigned int n = 0; n < m_count * K; ++n)
    if(m_queues[(i + n) % m_count].try_push(work)) return;
    m_queues[i % m_count].push(work);
    }

  6. Hi Martin, (re-posting as non-anonymous)

    why do you prefer the following signature as opposed to a simpler signature not including f’s arguments (they can be bound to the ‘work’ function externally)? This signature suggests perfect-forwarding, but I don’t think that really happens internally. Am I right?

    template
    void enqueue_work(F&& f, Args&&… args)
    {
    auto work = f,args… { f(args…); };
    unsigned int i = m_index++;
    for(unsigned int n = 0; n < m_count * K; ++n)
    if(m_queues[(i + n) % m_count].try_push(work)) return;
    m_queues[i % m_count].push(work);
    }

    1. I prefer this signature because it allows BOTH: binding arguments to a function in place and externally. the proc and args are forwarded correctly (moves) but then a copy has to be made when they are put on a queue unfortunately. I could solve it with shared pointers I suppose…

Leave a Reply