Raffaele del Gaudio | Are you sure you can implement a queue?

I often use queues for work and in personal projects, and I think you do too; for instance you may need to save bytes received from a socket for later processing, or to read from a file faster than stdio. Anytime you need two entities to communicate asynchronously in a FIFO pattern, you use a queue.

The most common implementation of a queue is the circular buffer, which uses a fixed-size buffer and two cursors over it, usually referred to as head and tail. They index the location of the next read and the next write. If you were to google for a queue implementation or ask ChatGPT to implement one, you would probably get something like this one:

typedef struct { uint8_t *data; size_t head; size_t tail; size_t nelem; size_t size; } ByteQueue; void queue_init(ByteQueue *q, uint8_t *buffer, size_t size) { q->data = buffer; q->size = size; q->head = 0; q->tail = 0; q->nelem = 0; } bool queue_push(ByteQueue *q, uint8_t byte) { bool ok = false; if (q->nelem < q->size) { q->data[q->tail] = byte; q->tail = (q->tail + 1) % q->size; q->nelem++; ok = true; } return ok; } bool queue_pop(ByteQueue *q, uint8_t *byte) { bool ok = false; if (q->nelem > 0) { *byte = q->data[q->head]; q->head = (q->head + 1) % q->size; q->nelem--; ok = true; } return ok; }

For the sake of future references, let's name the previous implementation "Basic Byte Queue (BBQ)".

I used this style of queue for quite a bit of time, but always felt unease as I had the feeling it could be improved quite a bit. Only recently I stopped and thought about how it could be improved. So, this article shows my process of improvement and how I got to an implementation that is 1000x faster, more elegant, and that provides more features.To put things in perspective, 1000x is the difference between processing 1 GB of data in 30 ms instead of 30 s!

All the code shown in this article, along with the final implementation and the tests for correctness and performance, is available in the github repo.

Changing the API

The starting point is the API: why call the push/pop functions on each byte? Wouldn't it be better if we managed bytes in batches? This would greatly reduce the cost associated to function calls, such as preparing the stack frame and jumping to the function. Apart from this overhead, you also have a number of operations that you could perform once for all the bytes to push/pop instead of doing them for each single byte.

This is how the queue_push and queue_pop operations look like with this modification:

size_t queue_push_vector(ByteQueue *q, const uint8_t *bytes, size_t count) { size_t tailcopy_size; size_t total = 0; if (q->head < q->tail || q->nelem == 0) { tailcopy_size = MIN(q->size - q->tail, count); memcpy(q->data + q->tail, bytes, tailcopy_size); q->tail = (q->tail + tailcopy_size) % q->size; total += tailcopy_size; } size_t headcopy_size = MIN(q->head - q->tail, count - tailcopy_size); if (headcopy_size > 0) { memcpy(q->data + q->tail, bytes + tailcopy_size, headcopy_size); q->tail += headcopy_size; total += headcopy_size; } q->nelem += total; return total; } size_t queue_pop_vector(ByteQueue *q, uint8_t *bytes, size_t count) { if (q->nelem == 0) return 0; if (q->head < q->tail) { size_t to_pop = MIN(q->tail - q->head, count); memcpy(bytes, q->data + q->head, to_pop); q->head = (q->head + to_pop) % q->size; q->nelem -= to_pop; return to_pop; } else { // q->head >= q->tail size_t frst_pop = MIN(q->size - q->head, count); memcpy(bytes, q->data + q->head, frst_pop); q->head = (q->head + frst_pop) % q->size; q->nelem -= frst_pop; size_t scnd_pop = MIN(q->tail - q->head, count - frst_pop); if (scnd_pop > 0) { memcpy(bytes + frst_pop, q->data + q->head, scnd_pop); q->head = (q->head + scnd_pop) % q->size; q->nelem -= scnd_pop; } return frst_pop + scnd_pop; } }

With this API, the push operation expects a pointer to an array of bytes and its length and it tries to copy all of them in the queue. The pop operation does something similar in the opposite direction.

Removing unnecessary copies

The second modification comes by noticing that the most expensive operation on computers in terms of latency is accessing memory. Roughly, accessing RAM requires 10x the time needed to access L2 cache, and accessing the L2 requires 10x the time needed for accessing the L1.

We are actually doing an unnecessary copy in the previous implementation because the push operation could provide a direct pointer to the internal queue memory and the maximum number of bytes the user can write, and the same applies for the pop operation.

void *queue_get_push_buf(ByteQueue *q, size_t *len) { if (q->head < q->tail || q->nelem == 0) *len = q->size - q->tail; else // q->head >= q->tail *len = q->head - q->tail; return q->data + q->tail; } void queue_commit_push(ByteQueue *q, size_t len) { q->tail = (q->tail + len) % q->size; q->nelem += len; } void *queue_get_pop_buf(ByteQueue *q, size_t *len) { if (q->head < q->tail || q->nelem == 0) *len = q->tail - q->head; else // q->head >= q->tail *len = q->size - q->head; return q->data + q->head; } void queue_commit_pop(ByteQueue *q, size_t len) { q->head = (q->head + len) % q->size; q->nelem -= len; }

With this, the user needs to commit the operations to inform the queue on how much bytes he pushed or popped but you can clearly see that this code allows the program to skip one useless data copy.

Optimizing the state representation

Another step forward is reducing the internal state to just two of the three variables: head, tail, and nelem. The reason we have them in the first place is that when head == tail, it’s impossible to distinguish whether the queue is empty or full once the producer has filled it. Some implementations remove nelem and use a special value (e.g. -1) for head and tail to indicate emptiness; others remove tail and only rely on head and nelem. But there’s also a third, more elegant solution: using only head and tail, without any special marker, while still being able to clearly distinguish between an empty and a full queue.

By restricting the queue length to a power of two and keeping head and tail as ever-increasing counters (without applying modulo immediately), it becomes straightforward to distinguish between an empty and a full queue. In this scheme, head == tail always indicates that the queue is empty, while (tail - head) == capacity always indicates that it is full. Even when tail overflows and wraps around SIZE_MAX, the implicit modulo arithmetic (mod SIZE_MAX+1) ensures correctness. The actual index into the buffer can then be computed by applying the modulo operation only when needed for access.

NOTE: When storing head and tail without the modulo, the queue length MUST be a power of 2, otherwise the implicit modulo will lead to incorrect states. As an example (provided by skeeto while reviewing this article): Suppose the queue has capacity == 3, and currently head == SIZE_MAX, about to tick over to zero. To pop the next element you access data[head % 3] == data[0], because SIZE_MAX % 3 == 0. After the pop, you update: head = head + 1 == 0, because SIZE_MAX + 1 == 0. We've incremented head but the effective value is still zero!

Making it MT-safe and lock-free

By using head and tail, we gain the additional benefit that the queue can be made MT-safe with only minimal modification. If we only allow one producer thread and one consumer thread, we can even avoid locks, making the implementation entirely lock-free.

The only needed thing is that head and tail are accessed with acquire/release consistency to be sure that the result of buffer memory is accessed in the correct order with respect with state updates. If head and nelem were used to represent the state, both would need to be updated atomically by the consumer after a pop; otherwise, the state could become inconsistent. This would either require a lock or storing both variables in a single word that can be updated atomically. Compared to the previous approach, however, this feels like an unnecessary complication.

This is how the operations look like after applying the two previous observations:

void *queue_get_push_buf(ByteQueue *q, size_t *len) { size_t head = __atomic_load_n(&q->head, __ATOMIC_ACQUIRE); size_t head_n_wrap = head / q->size; size_t tail_n_wrap = q->tail / q->size; if (head_n_wrap == tail_n_wrap) // head and tail are in the same block of // q->size bytes *len = q->size - (q->tail % q->size); else *len = q->size - (q->tail - head); return q->data + (q->tail % q->size); } void queue_commit_push(ByteQueue *q, size_t len) { __atomic_store_n(&q->tail, q->tail + len, __ATOMIC_RELEASE); } void *queue_get_pop_buf(ByteQueue *q, size_t *len) { size_t tail = __atomic_load_n(&q->tail, __ATOMIC_ACQUIRE); size_t head_n_wrap = q->head / q->size; size_t tail_n_wrap = tail / q->size; if (head_n_wrap == tail_n_wrap) // head and tail are in the same block of // q->size bytes *len = tail - q->head; else // q->head >= q->tail *len = tail - q->head - (tail % q->size); return q->data + (q->head % q->size); } void queue_commit_pop(ByteQueue *q, size_t len) { __atomic_store_n(&q->head, q->head + len, __ATOMIC_RELEASE); }

Removing divisions and branches

Let's have a look at a piece of the final implementation available in the github repo:

static void *bq_popbuf(bq *q, size_t *len) { size_t tail = __atomic_load_n(&q->tail, __ATOMIC_ACQUIRE); // The cond variable is 0 iff tail is in the same block of // (q->mask + 1) bytes, otherwise is: // -- 1, if tail is in the next block and has not wrapped around SIZE_MAX, // -- A big odd negative number, when tail has wrapped around SIZE_MAX. // With the final bitwise AND, we reduce the possible results to (0, 1). // We use this variable to subtract from the final number conditionally. size_t cond = ((tail >> q->cap_lg2) - (q->head >> q->cap_lg2)) & 0x1; *len = tail - q->head - (tail & q->mask) * cond; return q->data + (q->head & q->mask); } static void bq_pop(bq *q, size_t count) { __atomic_store_n(&q->head, q->head + count, __ATOMIC_RELEASE); } static void *bq_pushbuf(bq *q, size_t *len) { size_t head = __atomic_load_n(&q->head, __ATOMIC_ACQUIRE); // The cond variable is 0 iff tail is in the same block of // (q->mask + 1) bytes, otherwise is: // -- 1, if tail is in the next block and has not wrapped around SIZE_MAX, // -- A big odd negative number, when tail has wrapped around SIZE_MAX. // With the final bitwise AND, we reduce the possible results to (0, 1). // We use this variable to subtract from the final number conditionally. size_t cond = ((q->tail >> q->cap_lg2) - (head >> q->cap_lg2)) & 0x1; *len = q->mask + 1 - (q->tail - head) - (head & q->mask) * (1 - cond); return q->data + (q->tail & q->mask); } static void bq_push(bq *q, size_t count) { __atomic_store_n(&q->tail, q->tail + count, __ATOMIC_RELEASE); }

The cond variable lets us eliminate explicit if/else branches by turning the conditional adjustment of *len into a simple multiplication. In other words, instead of branching, we encode the “only when cond == 1” logic directly into the arithmetic expression.

Perfomance comparison

Now let's have some fun and compare the speed of these different implementations.

The following plot compares the time required to transfer 1 GB of data through the queues. The queue size is fixed at 1 MB, and the plot shows the clock cycles per byte as a function of the maximum number of bytes allowed per operation.

We can see that the major source of overhead reduction is from BBQ to VBQ because of the elimination of all those function calls. This optimization alone generates a minimum reduction of 8x and a maximum of roughly 1600x in the number of cycles per byte and the performance gain you get from this optimization grows with how much data you can process per operation.

The second most impacting optimization is the one from VBQ to ABQ because of the reduction of the unnecessary copy. The performance gain you get from this optimization grows with how much data you can process per operation; indeed you get 20%/30% reduction with relatively few bytes per operation but if you can move more than 64 KB per operation you can get more than 2x reduction in the number of cycles per byte.

The performance gain you get from the last two optimizations (LFQ and BQ) decreases with how much data you can process per operation because, when moving a lot of bytes at once, the clock cycles needed to lock and unlock the mutex and do divisions and modulo operations are very few compared to those needed to actually move the data. In other words, if you are able to move a lot of bytes per operation, the number of calls to the queue functions will decrease, so optimizing them will have less impact on performance. That said, when moving 64 Bytes or 512 Bytes per operation you get roughly 18% reduction in the number of cycles per byte: not bad!

Any feedback?

If you have any questions or feedback regarding this article, you can find my contact information on the homepage.

Are you sure you can implement a queue?