Are you sure you can implement a queue?

I often use queues for work and in personal projects, and I think you do too; for instance you may need to save bytes received from an UART for later processing, or to read from a file faster than stdio. Anytime you need two entities to communicate asynchronously in a FIFO pattern, you use a queue.

The most common implementation of a queue is the circular buffer, which uses a fixed-size buffer and two cursors over it, usually referred to as head and tail. They index the location of the next read and the next write. If you were to google for a queue implementation or ask ChatGPT to implement one, you would probably get something like this one:

typedef struct { uint8_t *data; size_t head; size_t tail; size_t nelem; size_t size; } ByteQueue; void queue_init(ByteQueue *q, uint8_t *buffer, size_t size) { q->data = buffer; q->size = size; q->head = 0; q->tail = 0; q->nelem = 0; } bool queue_push(ByteQueue *q, uint8_t byte) { bool ok = false; if (q->nelem < q->size) { q->data[q->tail] = byte; q->tail = (q->tail + 1) % q->size; q->nelem++; ok = true; } return ok; } bool queue_pop(ByteQueue *q, uint8_t *byte) { bool ok = false; if (q->nelem > 0) { *byte = q->data[q->head]; q->head = (q->head + 1) % q->size; q->nelem--; ok = true; } return ok; }

For the sake of future references, let's name the previous implementation "Basic Byte Queue (BBQ)".

I used this style of queue for quite a bit of time, but always felt unease as I had the feeling it could be improved quite a bit. Only recently I stopped and thought about how it could be improved. So, this article shows my process of improvement and how I got to an implementation that is 1000x faster, more elegant, and that provides more features.To put things in perspective, 1000x is the difference between processing 1 GB of data in 30 ms instead of 30 s!

All the code shown in this article, along with the final implementation and the tests for correctness and performance, is available in the github repo.

Changing the API

The starting point is the API: why call the push/pop functions on each byte? Wouldn't it be better if we managed bytes in batches? This would greatly reduce the cost associated to function calls, such as preparing the stack frame and jumping to the function. Apart from this overhead, you also have a number of operations that you could perform once for all the bytes to push/pop instead of doing them for each single byte.

This is how the queue_push and queue_pop operations look like with this modification:

size_t queue_push_vector(ByteQueue *q, const uint8_t *bytes, size_t count) { size_t tailcopy_size; size_t total = 0; if (q->head < q->tail || q->nelem == 0) { tailcopy_size = MIN(q->size - q->tail, count); memcpy(q->data + q->tail, bytes, tailcopy_size); q->tail = (q->tail + tailcopy_size) % q->size; total += tailcopy_size; } size_t headcopy_size = MIN(q->head - q->tail, count - tailcopy_size); if (headcopy_size > 0) { memcpy(q->data + q->tail, bytes + tailcopy_size, headcopy_size); q->tail += headcopy_size; total += headcopy_size; } q->nelem += total; return total; } size_t queue_pop_vector(ByteQueue *q, uint8_t *bytes, size_t count) { if (q->nelem == 0) return 0; if (q->head < q->tail) { size_t to_pop = MIN(q->tail - q->head, count); memcpy(bytes, q->data + q->head, to_pop); q->head = (q->head + to_pop) % q->size; q->nelem -= to_pop; return to_pop; } else { // q->head >= q->tail size_t frst_pop = MIN(q->size - q->head, count); memcpy(bytes, q->data + q->head, frst_pop); q->head = (q->head + frst_pop) % q->size; q->nelem -= frst_pop; size_t scnd_pop = MIN(q->tail - q->head, count - frst_pop); if (scnd_pop > 0) { memcpy(bytes + frst_pop, q->data + q->head, scnd_pop); q->head = (q->head + scnd_pop) % q->size; q->nelem -= scnd_pop; } return frst_pop + scnd_pop; } }

Let's call this implementation the "Vectorial Byte Queue (VBQ)".

With this API, the push operation expects a pointer to an array of bytes and its length and it tries to copy all of them in the queue. The pop operation does something similar in the opposite direction.

Removing unnecessary copies

The second modification comes by noticing that the most expensive operation on computers in terms of latency is accessing memory. Roughly, accessing RAM requires 10x the time needed to access L2 cache, and accessing the L2 requires 10x the time needed for accessing the L1.

The rule of thumb here is: always avoid unnecessary copies!

We are actually doing an unnecessary copy in the previous implementation because the push operation could provide a direct pointer to the internal queue memory and the maximum number of bytes the user can write, and the same applies for the pop operation.

This is how the two operations look like after applying this change:

void *queue_get_push_buf(ByteQueue *q, size_t *len) { if (q->head < q->tail || q->nelem == 0) *len = q->size - q->tail; else // q->head >= q->tail *len = q->head - q->tail; return q->data + q->tail; } void queue_commit_push(ByteQueue *q, size_t len) { q->tail = (q->tail + len) % q->size; q->nelem += len; } void *queue_get_pop_buf(ByteQueue *q, size_t *len) { if (q->head < q->tail || q->nelem == 0) *len = q->tail - q->head; else // q->head >= q->tail *len = q->size - q->head; return q->data + q->head; } void queue_commit_pop(ByteQueue *q, size_t len) { q->head = (q->head + len) % q->size; q->nelem -= len; }

Let's call this implementation the "Acked Byte Queue (ABQ)".

With this, the user needs to commit the operations to inform the queue on how much bytes he pushed or popped but you can clearly see that this code allows the program to skip one useless data copy.

Making it MT-safe and lock-free

Another step forward is reducing the internal state to just two of the three variables: head, tail, and nelem. The reason we have them in the first place is that when head == tail, it’s impossible to distinguish whether the queue is empty or full once the producer has filled it. Some implementations remove nelem and use a special value (e.g. -1) for head and tail to indicate emptiness; others remove tail and only rely on head and nelem. But there’s also a third, more elegant solution: using only head and tail, without any special marker, while still being able to clearly distinguish between an empty and a full queue.

If head and tail are stored as is, without doing the modulo, and the needed checks are performed before applying the modulo to the variables, we can disambiguate the two scenarios because head == tail is true iff the queue is empty. Then, when accessing the buffer is needed, the modulo can be performed to obtain the index.

We are almost there but there's a feature we would love our queue to have: multithreaded usage.

By using head and tail, we gain the additional benefit that the queue can be made MT-safe with only minimal modification. If we only allow one producer thread and one consumer thread, we can even avoid locks, making the implementation entirely lock-free. This holds because, even if a data race occurs between the producer and consumer, each thread only updates its own variable and any stale value still corresponds to a valid queue state, so no undefined behavior occurs.

The only needed thing is that head and tail are updated with release consistency to be sure that the buffer memory is updated before them. If head and nelem are chosen to represent the state, you need to atomically modify both in the consumer after a pop or you would have an incorrect state. This leads to the need for a lock or to store both variable in a single word accessible atomically, but this seems to me too much of a trick considering the previous option.

This is how the operations look like after applying the two previous observations:

void *queue_get_push_buf(ByteQueue *q, size_t *len) { // This private copy of head is essential to have a coherent // value throughout the function, regardless of the consumer's // actions. size_t head = q->head; size_t head_n_wrap = head / q->size; size_t tail_n_wrap = q->tail / q->size; if (head_n_wrap == tail_n_wrap) // head and tail are in the same block of // q->size bytes *len = q->size - (q->tail % q->size); else *len = q->size - (q->tail - head); return q->data + (q->tail % q->size); } void queue_commit_push(ByteQueue *q, size_t len) { __atomic_store_n(&q->tail, q->tail + len, __ATOMIC_RELEASE); } void *queue_get_pop_buf(ByteQueue *q, size_t *len) { // This private copy of tail is essential to have a coherent // value throughout the function, regardless of the consumer's // actions. size_t tail = q->tail; size_t head_n_wrap = q->head / q->size; size_t tail_n_wrap = tail / q->size; if (head_n_wrap == tail_n_wrap) // head and tail are in the same block of // q->size bytes *len = tail - q->head; else // q->head >= q->tail *len = tail - q->head - (tail % q->size); return q->data + (q->head % q->size); } void queue_commit_pop(ByteQueue *q, size_t len) { __atomic_store_n(&q->head, q->head + len, __ATOMIC_RELEASE); }

Let's call this implementation the "Lock-Free Queue (LFQ)".

A small but crucial detail in a multithreaded scenario is the use of a private copy of head in the push operation and of tail in the pop operation. If you don’t read these variables into a local copy and only use that, you can possibly end up using different values throughout the function computations, which may lead to incorrect results.

Removing divisions and branches

There are other two improvements that we can make:

  1. Eliminate divisions and modulo operations with bitwise shifts and ANDs, which is possible if we impose that the queue's capacity is a power of 2.
  2. Remove the if statements to make the implementation completely branchless.

Let's have a look at a piece of the final implementation available in the github repo:

static void *bq_popbuf(bq *q, size_t *len) { size_t tail = q->tail; size_t cond = (tail >> q->cap_lg2) - (q->head >> q->cap_lg2); *len = tail - q->head - (tail & q->mask) * cond; return q->data + (q->head & q->mask); } static void bq_pop(bq *q, size_t count) { __atomic_store_n(&q->head, q->head + count, __ATOMIC_RELEASE); } static void *bq_pushbuf(bq *q, size_t *len) { size_t head = q->head; size_t cond = (q->tail >> q->cap_lg2) - (head >> q->cap_lg2); *len = q->mask + 1 - (q->tail - head) - (q->head & q->mask) * (1 - cond); return q->data + (q->tail & q->mask); } static void bq_push(bq *q, size_t count) { __atomic_store_n(&q->tail, q->tail + count, __ATOMIC_RELEASE); }

Let's call this implementation the "Byte Queue (BQ)".

We managed to eliminate the if statements by observing that head_n_wrap and tail_n_wrap can only differ by 0 or 1. This allowed us to collapse the if/else branches into a single multiplication.

Perfomance comparison

Now let's have some fun and compare the speed of these different implementations. Note: Since they are tested in a two-threads scenario, we need to add a mutex to protect the BBQ, VBQ and ABQ operations so that they become MT-safe.

The following plot compares the time required to transfer 1 GB of data through the queues. The queue size is fixed at 1 MB, and the plot shows the clock cycles per byte as a function of the maximum number of bytes allowed per operation.

Processing 1 GB of data on queues of 1MB of length

We can see that the major source of overhead reduction is from BBQ to VBQ because of the elimination of all those function calls. This optimization alone generates a minimum reduction of 8x and a maximum of roughly 1600x in the number of cycles per byte and the performance gain you get from this optimization grows with how much data you can process per operation.

The second most impacting optimization is the one from VBQ to ABQ because of the reduction of the unnecessary copy. The performance gain you get from this optimization grows with how much data you can process per operation; indeed you get 20%/30% reduction with relatively few bytes per operation but if you can move more than 64 KB per operation you can get more than 2x reduction in the number of cycles per byte.

The performance gain you get from the last two optimizations (LFQ and BQ) decreases with how much data you can process per operation because, when moving a lot of bytes at once, the clock cycles needed to lock and unlock the mutex and do divisions and modulo operations are very few compared to those needed to actually move the data. In other words, if you are able to move a lot of bytes per operation, the number of calls to the queue functions will decrease, so optimizing them will have less impact on performance. That said, when moving 64 Bytes or 512 Bytes per operation you get roughly 18% reduction in the number of cycles per byte: not bad!

Any feedback?

If you have any questions or feedback regarding this article, you can find my contact information on the homepage.