Comparative Results and Takeaways
TL;DR: The Ringbuffer implementation saw a 38x decrease in average execution time when compared to the original channel implementation.
We ran a myriad of benchmarks to determine the performance profiles of the optimized ring buffer approach (RingbufferV6) and the optimized channel implementation (SimpleFanV3). To start, we set the number of subscribers to a constant, then increased the number of trades. We then set the number of trades to a constant and increased the number of subscribers. The goal was to understand how the two approaches performed with different constraints. When the trade counts increase, we would expect to see higher allocations and more GC thrash in the channel implementation, however more subscribers would introduce more scheduling and communication overhead on the system.
The benchmarks revealed a striking difference between the two implementations. When considering the peak trade volume (10 million), the Ringbuffer saw a 10x decrease in allocations (allocs/ops) and a 24x decrease in average execution time (ns/ops) compared to the channel implementation. At the max subscriber benchmark (10 thousand), the Ringbuffer similarly outperformed the channel implementation with a 398x decrease in average execution time and a 10x decrease in allocations. We attempted to run benchmarks with an increasing number of subscribers, but the channel implementation was unable to handle any measurable increase in subscribers.
The charts below compare the Ringbuffer to the channel implementation with 1,000 subscribers and a variable number of trades.
Even for benchmarks with a varying number of subscribers and a set number of trades, the Ringbuffer continued to outperform the channel implementation, demonstrating significantly lower allocations and a more constant time per operation. While the Ringbuffer might use slightly more memory for a higher number of subscribers (a difference of about 10 megabytes), the 11x decrease in allocs/op and 398x decrease in ns/ops at peak scale more than justify the base memory increase.
The charts below show the performance of the ringbuffer and channel implementations with a constant 100 thousand trades and an increasing number of subscribers.