I often use queues for work and in personal projects, and I think you do too; for instance you may need to save bytes received from an UART for later processing, or to read from a file faster than stdio. Anytime you need two entities to communicate asynchronously in a FIFO pattern, you use a queue.
The most common implementation of a queue is the circular buffer, which uses a fixed-size buffer and two cursors over it, usually referred to as head and tail. They index the location of the next read and the next write. If you were to google for a queue implementation or ask ChatGPT to implement one, you would probably get something like this one:
For the sake of future references, let's name the previous implementation "Basic Byte Queue (BBQ)".
I used this style of queue for quite a bit of time, but always felt unease as I had the feeling it could be improved quite a bit. Only recently I stopped and thought about how it could be improved. So, this article shows my process of improvement and how I got to an implementation that is 1000x faster, more elegant, and that provides more features.To put things in perspective, 1000x is the difference between processing 1 GB of data in 30 ms instead of 30 s!
All the code shown in this article, along with the final implementation and the tests for correctness and performance, is available in the github repo.
The starting point is the API: why call the push/pop functions on each byte? Wouldn't it be better if we managed bytes in batches? This would greatly reduce the cost associated to function calls, such as preparing the stack frame and jumping to the function. Apart from this overhead, you also have a number of operations that you could perform once for all the bytes to push/pop instead of doing them for each single byte.
This is how the queue_push and queue_pop operations look like with this modification:
Let's call this implementation the "Vectorial Byte Queue (VBQ)".
With this API, the push operation expects a pointer to an array of bytes and its length and it tries to copy all of them in the queue. The pop operation does something similar in the opposite direction.
The second modification comes by noticing that the most expensive operation on computers in terms of latency is accessing memory. Roughly, accessing RAM requires 10x the time needed to access L2 cache, and accessing the L2 requires 10x the time needed for accessing the L1.
The rule of thumb here is: always avoid unnecessary copies!
We are actually doing an unnecessary copy in the previous implementation because the push operation could provide a direct pointer to the internal queue memory and the maximum number of bytes the user can write, and the same applies for the pop operation.
This is how the two operations look like after applying this change:
Let's call this implementation the "Acked Byte Queue (ABQ)".
With this, the user needs to commit the operations to inform the queue on how much bytes he pushed or popped but you can clearly see that this code allows the program to skip one useless data copy.
Another step forward is reducing the internal state to just two of the three variables: head
,
tail
, and nelem
. The reason we have them in the first place is that when
head == tail
, it’s impossible to distinguish whether the queue is empty or full once the producer
has filled it. Some implementations remove nelem
and use a special value (e.g. -1) for
head
and tail
to indicate emptiness; others remove
tail
and only rely on head
and nelem
.
But there’s also a third, more elegant solution: using only head
and tail
,
without any special marker, while still being able to clearly distinguish between an empty and a full queue.
If head
and tail
are stored as is, without doing the modulo, and
the needed checks are performed before applying the modulo to the variables, we can disambiguate the two scenarios because
head == tail
is true iff the queue is empty. Then, when accessing the buffer is needed,
the modulo can be performed to obtain the index.
We are almost there but there's a feature we would love our queue to have: multithreaded usage.
By using head
and tail
, we gain the additional benefit that the queue can be made MT-safe with only minimal
modification. If we only allow one producer thread and one consumer thread, we can even avoid locks, making the implementation entirely lock-free. This
holds because, even if a data race occurs between the producer and consumer, each thread only updates its own variable and any stale value still
corresponds to a valid queue state, so no undefined behavior occurs.
The only needed thing is that head
and tail
are updated with
release consistency to be sure that the buffer memory is updated
before them.
If head
and nelem
are chosen to represent the state, you need to atomically
modify both in the consumer after a pop or you would have an incorrect state. This leads to the need for a lock or to store both variable in a
single word accessible atomically, but this seems to me too much of a trick considering the previous option.
This is how the operations look like after applying the two previous observations:
Let's call this implementation the "Lock-Free Queue (LFQ)".
A small but crucial detail in a multithreaded scenario is the use of a private copy of head
in the push
operation and of tail
in the pop operation. If you don’t read these variables into a local copy and
only use that, you can possibly end up using different values throughout the function computations, which may lead to incorrect results.
There are other two improvements that we can make:
Let's have a look at a piece of the final implementation available in the github repo:
Let's call this implementation the "Byte Queue (BQ)".
We managed to eliminate the if statements by observing that head_n_wrap
and
tail_n_wrap
can only differ by 0 or 1. This allowed us to collapse the if/else branches into a single multiplication.
Now let's have some fun and compare the speed of these different implementations. Note: Since they are tested in a two-threads scenario, we need to add a mutex to protect the BBQ, VBQ and ABQ operations so that they become MT-safe.
The following plot compares the time required to transfer 1 GB of data through the queues. The queue size is fixed at 1 MB, and the plot shows the clock cycles per byte as a function of the maximum number of bytes allowed per operation.
We can see that the major source of overhead reduction is from BBQ to VBQ because of the elimination of all those function calls. This optimization alone generates a minimum reduction of 8x and a maximum of roughly 1600x in the number of cycles per byte and the performance gain you get from this optimization grows with how much data you can process per operation.
The second most impacting optimization is the one from VBQ to ABQ because of the reduction of the unnecessary copy. The performance gain you get from this optimization grows with how much data you can process per operation; indeed you get 20%/30% reduction with relatively few bytes per operation but if you can move more than 64 KB per operation you can get more than 2x reduction in the number of cycles per byte.
The performance gain you get from the last two optimizations (LFQ and BQ) decreases with how much data you can process per operation because, when moving a lot of bytes at once, the clock cycles needed to lock and unlock the mutex and do divisions and modulo operations are very few compared to those needed to actually move the data. In other words, if you are able to move a lot of bytes per operation, the number of calls to the queue functions will decrease, so optimizing them will have less impact on performance. That said, when moving 64 Bytes or 512 Bytes per operation you get roughly 18% reduction in the number of cycles per byte: not bad!
If you have any questions or feedback regarding this article, you can find my contact information on the homepage.