Based on a true story from the San Diego C++ Meetup #46 1/2023

After our Jan 2023 virtual meeting I wanted to summarize all the cool stuff we learned during the session. This is the short version of the talk that I (Kobi) presented on Tuesday, Jan 17 2023.

First, I’d like to have few shout outs to few people that inspired this talk.

The first one, is Pablo Arias. Pablo had a 3 parts Coroutines blog post that explains how to use C++20 Coroutines and Linux Kernel’s feature io_uring. First part can be found here. It includes link to the source and discussion on reddit. This blog post is written so well that it helped me to finally understand how Coroutines work and I wanted to present it to the members of the San Diego C++ Meetup.

Next, I’d like to thank Rainer Grimm from the modernescpp blog. Rainer has many books that I mentioned during the San Diego C++ sessions, he has Educative.io class and he is a known speaker in many C++ conferences. His C++20 Coroutines blog posts are very helpful.

Coroutines change the way you write Parallel and Multi-threaded code, to the better IMO. Once this topic is well understood, you will want to use it in your code base as it’d simplify many constructs.

I’d like to touch upon few main issues when it comes to Coroutines. Why this is so hard?

  1. First, it is a complicated domain. Async, MT and similar are usually harder to reason about.
  2. There are new keywords that we need to learn and understand how to use.
  3. There is a new Coroutines API. How do we use it? What do we need implement? do we need to inherit? what is the new contract between different pieces of the app when it comes to Coroutines?
  4. Compiler magic. The compiler does a lot of things under the hood and we need to be aware of the various things that are happening. We usually know that compiler might generate special functions for us (e.g. copy constructor) but Coroutines is a new level of “compiler magic”.
  5. When Coroutine function is running, what is the thread context? what is the thread context of “Awaitables”? what happens when we resume? which thread is being used here? These are real questions that we need to answer to fully understand this domain. The regular execution flow is very different from what we are used to see before, even when threads are involved.
  6. C++20 gave us “the assembly of Coroutines”. This is how I name it. During the meetup session i mentioned that in his latest book, Bjarne does not recommend writing your own Coroutines boilerplate code and use libraries instead. During the talk, I used the low-level Coroutines API.

I presented the following from modernescpp:

This is a good image illustrating the new flow of function calls.

There are numerous amount of blog posts out there that explain what is a Coroutine function/object/frame. For our purpose in this post, I’ll touch upon few elements but will not cover everything under the sun when it comes to Coroutines. The goal is for the readers to have basic++ knowledge that jump-start them to work on more complicated constructs.

What does qualify a function to be a coroutine?

It has one or more specific keywords for example:

  • co_await – suspend and resume execution of the function while keeping the context (think stack) intact!
  • co_return – return results from Coroutine frame/function.
  • co_yield – supports generator functions.

coroutines proposal N4402 describes design goals.

C++20 Coroutines framework includes many functions to support various use case and it is highly customizable. Main parts:

  • Coroutine handle – this is your handle to the Coroutine context.
  • promise object – think about it as a way to communicate results back to a waiter.
  • Coroutine frame – holds the state of the Coroutine function.

Task and Promise

Promise provides result or exception. It is the result of the work done in the Coroutine function.

A pseudo code that represent a possible promise struct, residing inside a “Task” will look like the following:

Promise is highly customizable. User can control whether:

  • Coroutine should be suspended before it runs (initial_suspend), or before it ends (final_suspend).
  • unhandled_exception – invoked when uncaught exception happens.
  • get_return_object – returns, constructs the Coroutine object itself, many will name it “Task”.
  • return_value – This is “magically” invoked by compiler-generated code when we write something like co_return val;
  • yield_value – related to generators when we invoke co_yield val;

get_return_object() mentioned above is super important. It is creating a “Task” instance. Promise type is nested inside a Task.

Promise will communicate the results (or void) using the return_value/return_void/yield_value.

Awaitable

Awaitables are structs/classes that we, the author of the code/framework will write.

An Awaitable, is a type that you co_await on its instance.

So how does a birds-view Coroutine structure in your program would look like?

Here is a pseudo code of Task with a nested promise, an awaitable with operator co_await() and finally, a Coroutine function named “func”. This Coroutine function returns a “Task” as its return value, and will co_await on 2 different awaitables. It returns result using co_return.

Coroutine Handle

This is a super important element that enabled the caller to access the Coroutine frame/API – for example resuming a suspended Coroutine.

In the above code snippets, you’d find the handle in class Task – which is mapped to a Coroutine object. In the following code snippets, you’d also find this handle being passed as a parameter to the Awaiter::await_suspend(handle) function.

Coroutine Frame

This is internal, i.e. not really visible to us, authors of the code, most likely a compiler’s heap allocated state.

It containes a promise object with all the Application related context, aka my Application business logic.

Example – Coroutine with UDP server and io_uring Example

As mentioned, this part was heavily inspired by Pablo Arias blog post.

In this exercise we will demonstrate how we can combine Coroutines with file reading/writing as well as working with networking protocols.

In the meetup recording you can find the conan 1.x configuration, using crc_cpp and fmt.

Demonstrating Synchronous first

Before we jump into demonstrating how coroutines + io_uring makes our life easier, let’s show the synchronous part first.

In the above we show a udp_connection runs a blocking read() function. Once it is done, it will populate a vector and compute CRC16 to show some “work done” (CRC is a 3rdparty library).

udp_connection::read() would look like this:

Pretty simple and straight forward. Compute CRC16 is just looping over the buffer and computing the value. It is omitted here for brevity but the information can be found on meetup slidedeck (and my github.com).

Struct Result is just a simple struct to aggregate and collect information that will be eventually printed to screen:

Here is the complete “main loop” for the Sync example:

I’m just showing 2 connections but you can easily extrapolate it to be >2.

Enter Coroutines and Linux io_uring

First let’s discuss io_uring. The simplest way to think about it, is submitting a request to the kernel to do it on user-space behalf, and then gathering results back when work is done and results are ready:

Once we have this concept understood, we will tie it into Coroutines using the following flow:

The part where we Read UDP will be offloaded using Coroutines mechanism in conjunction with another Coroutine piece that will write results into files. For each context, the UDP read and File Write operations would be 2 different awaitables triggered by the same Coroutine frame.

The nice part about Coroutines is that the function “launching” these awaitable will be suspended and resumed while the entire function’s context stays “intact”. This feels very natural when reading the code. There are no messy state machines and complex locking code. It’s all abstracted out using the Coroutine framework.

The following diagram depict how Coroutine frames would look like. A Coroutine frame is mapped here to a function names readUdpWriteFileAsync().

There are couple arrows going out from the frame into functions. Think about these functions as awaitables.

The diagram shows 2 frames. Reason is that we’d have the main() function looping over 2 connections. Each connection will invoke the Coroutine function, creating a whole separate Coroutine frame and context.

When we invoke the read_udp_socket inside readUdpWriteFileAsync() Coroutine function, it would be suspended so we can continue our main() iteration moving to the next udp_connection, this is while the previous one is “taking its time” reading from a socket in a separate thread.

This is the true exciting moment of Coroutines. I have a loop calling a function, this function will shoot an async operation, running on some thread (details below) doing its thing while I continue looping and handling the next iteration. When I need to go back to _a_ Coroutine frame, it will be there waiting for me with the entire stack context intact!

Let’s take a look at main()

I just create 2 UDP connections objects and you can easily extend it to >2. No socket activity here, just initialized UDP class type objects.

The following code is important. It will actually trigger the creation of the Coroutine frames. 2 in our example. Each time we invoke readUdpWriteFileAsync(), a Coroutine frames created, returning a “Task” class instance. More on “Task” below. We just keep the tasks in a vector so we can use each later.

liburing is used here to abstract out some io_uring boilerplate code.

Pablo had a nice liburing wrapper that I’m going to use here. It’s a simple wrapper over io_uring handle, implementing RAII (init, destroy). This is going to be very handy when interacting with io_uring via liburing library.

Promise & Task

After introducing the concept and some of the non-coroutine framework, let’s delve into the actual Coroutine-framework details.

We will start with defining the “Task” class type:

There is an inner struct names promise_type. As mentioned above, promise would include my business logic and you can see class data member named result.

Few important functions that I implemented (there is also the inheritance way btw).

  • get_current_object – Instantiating and returning of a Task. Just calling it constructor with this argument.
  • I don’t care about exceptions. It’s an empty method here.
  • return_value – will be “magically” invoked by compiler-generated code. My function definition is going to set the incoming parameter to my internal data member.
  • Don’t care about initial_suspend/final_suspend. I’m returning suspend_never which means – do not suspend at init/cleanup points.

Let’s zoom out and see what else we have inside class Task:

struct promise is collapsed as we’ve already discussed it.

Task has a constructor that takes promise* parameter. No surprises here since promise_type‘s get_return_object() returns Task(this).

The Coroutine handle is obtained from promise. The Coroutine handle is our mechanism to control the Coroutine ‘sflow. If we hold a Coroutine handle, we can, for example, check if the Coroutine is done() processing or we can resume() the Coroutine. We use this handle few times in our application.

Task has one important data member. The Coroutine handle! This is our opaque handle to the Coroutine frame/context.

  • get_result() – will return the (final business logic) result, residing inside the promise_type.
  • done() – will return true if the Coroutine is done executing. Essentially when we exit from the Coroutine function/frame. In our case existing readUdpWriteFileAsync().
  • Destructor destroys the handle.

The awaitables

We have 2 awaitables.

  1. For UDP read
  2. For writing data to files on disk.

Here is how it would look like:

The Coroutine function/frame is readUdpWriteFileAsync().

  • When we want to co_await on the UDP read part we will call co_await read_udp_socket. A thread will be created and used to fulfill and run this context. When we are done, we are going to resume the suspended frame. We use the Coroutine handle to resume the suspended Coroutine. read_udp_socket is our first awaitable.
  • When this specific frame instance is resumed, we will interact with io_uring framework, submitting job to the kernel to execute. Here we don’t need to worry about any user space threads since this job is executed in the context of kernel space. We just need to collect the results when we are ready. When results are ready, we will resume the suspended frame, running it to completion. Task will be done here.

Coroutine function readUdpWriteFileAsync

Here is the implementation of our Coroutine function with 2 Awaitables.

First Awaitable is read_udp_socket(). You can notice the special keyword “co_await”. More details to follow.

Next, we create a file and move to the next Awaitable: write_file_awaitable(). co_await new C++20 keyword is also used here.

At the end of the frame, we return the result using co_return keyword.

When running the applications, lots of stuff happening by a compiler-generated code. For example, when we call the Coroutine function, a Task::promise_type::get_return_object() function is immediately invoked. A debugger is a handy tool here to view the stack-trace. You’ll notice some strange jumps of the debugger as if the Task function readUdpWriteFileAsync() is “exiting” but it’s not. It’s just the debugger showing that a Task object was created and it will jump “all over the place”.

To assist us with collecting results AND holding the coroutine_handle, I will create a simple struct to aggregate both:

Awaitable read_udp_socket

Let’s zoom-in into our first Awaitable. It’s a class type with operator co_await:

operator co_await() is a new C++20 special construct. We need to return an “Awaitable” type instance. What is Awaitable? It’s a type with a specific interface that we need to implement. Note the return value of operator co_await: it’s the Awaiter struct instance initialized with udp_. What interface do we have in Awaitable?

  • await_ready – we return false. We are not ready yet! and work needs to be done. We need the Task/frame to be suspended!
  • await_suspend – An customization point before we suspend the “outer Task/frame”. We keep the coroutine_handle by assigning it to some structure as we will need it soon! we need the handle to tell the Task/frame “continue/resume”. The handle is our way to control the flow of the Task!

So the idea is simple. The outer Task calls co_await on the Awaitable. Operator co_await kicks in. We keep the coroutine_handle before we suspend the Task and executing the work on a dedicated thread. Remember, now, the outer Task is suspended! The main loop will continue with the next UDP connection instantiation.

When we resume the Awaitable, await_resume() is invoked on this Awaitable instance, and the result is returned as the value. If you set a breakpoint at Awaiter::await_resume() you will see how invoking resume() on the coroutine_handle will call some compiler generated symbol which will invoke Awaiter::await_resume(). It’s all can be viewed as part of the gdb stacktrace. Here is a typical example:

You probably noticed the function readSync(). This function does a simple blocking UDP read + CRC16 check. When it’s done, it takes the coroutine_handle that was passed as an argument and call resume().

Writing to disk using io_uring

This is the second Awaitable. Unlike the UDP one, this does not run on std::thread context, but using Kernel threads instead. Here is a helper struct named write_ctx. We are going to use it as an opaque data/cookie to the Kernel and receive it back later(think cookie).

The struct has similar idea as the UDP’s one. It includes std::coroutine_handle<void> and some result (status_code).

And here is the full write_file_awaitable Awaitable class type with its internal operator co_await().

The constructor of this Awaitable is obtaining io_uring_sqe which is a token corresponding to the requested job to be done by the Kernel on our behalf. We request to write a chunk of memory to file using this token. This is the first part where the instance of the Awaitable is being created.

Now the co_await operator kicks in. Same idea as UDP. We need to implement few interface functions.

  • await_ready() – no, we are not ready, please suspend the Task.
  • await_suspend() – customization point before the Task is suspended. We cache the handle and we tell the Kernel – please associate the following piece of data with the token. Remember, without this coroutine_handle we cannot resume when we are done with our file write! When the Kernel is done performing the requested job, writing to file in our case, it would provide us back the opaque data that we provide to it when calling io_uring_sqe_set_data(). Finally, we call io_uring_submit() to get this whole io_uring process going.
  • await_resume() – when the Task is resumed, the status_code is returned back. The status_code in this example correspond to success/failure of writing into file.

io_uring results – processing logic

It’s time to gather results and process it.

Remember that we had a vector of “Tasks”?

These Tasks are actually mapped to our Coroutine functions.

When we invoked the Coroutine function, we got a “Task” back as a return value and “push_back” it to our std::vector<Tasks>.

Task class type has a function named “done“. This function will return the Coroutine handle’s “done” state. If it’s true, we are done with the Coroutine. Either we have all awaitables done or exception propagated. The important part is that the function is done and we can query the handle in order to figure it out.

Function all_done just iterates over the vector and ask if we are done.

When we are all done, we just print something. This is a simple final “processing” and obviously real-life logic would do more here.

But inside the while loop, there is a special call to consumeCQEntriesBlocking(). Let’s take a look and discuss the last part of this flow.

When we submit a request to the Kernel via io_uring framework, at some point of time, we would like to retrieve the result (or error if such returned back).

io_uring_submit_and_wait() is a liburing function to block if queue is empty and if we have at least 1 completed entry waiting for us, we will return from the function and process it.

This is what consumeCQEntries() function is doing. The coroutine magic part is the line that invokes resume() on the handle.

When we submitted our request, we also submitted a “cookie”. An opaque data that we are now retrieving back! We static_cast<> it and now have access to the status_code as well as to the Coroutine handle! This is a key element for us to control the Coroutine.

We can now resume the Coroutine and the function readUdpWriteFileAsync() will continue to run. In our case will complete and exit using the co_return keyword. It will return the Result struct instance.

Handle’s done() is now going to return us “true”.

When we call .resume() and the Coroutine continues to run, it’d run on the “main” thread. You can think about it as the main thread is borrowed to execute the remaining code in the Coroutine function all the way to the co_return statement.

Summary

The above demonstrated single Coroutine function returning a “Task” class type, with 2 Awaitables – “UDP socket receive” and “file write” using Linux io_uring. We showed how to use Coroutine handle to obtain info on the Coroutine frame as well as control the actual flow of the program.

There is a big advantage of having the context of the frame kept intact, valid, in one location while all kinds of jobs are triggered, suspended and resumed without any need for synchronization. In traditional multi-threading/multitasking you’d need synchronization primitives in order to achieve the same.

Thank you for reading!

Kobi

Leave a Reply