Coinbase Logo

Language and region

Optimizing Producer-Consumer Architecture for Market Data at Coinbase

TL;DR: Migrating Market Data Services to a low allocation in-memory ringbuffer architecture drastically improved latency and performance.

By Chris Diehl

, August 4, 2025

Coinbase Blog

At Coinbase, we're constantly striving to optimize our systems for efficiency and performance. Within trading, we’re especially interested in optimizing for both latency and scale, with our systems being critical for users to have the latest information available to make informed trades.

We’ll delve into some of the results seen when re-architecting our market data systems to focus on reducing latency without sacrificing scalability. This project, while specific in its application, has provided valuable insights into the broader challenge of optimizing producer-consumer architectures with the goal being to show the discovered learnings and how these optimizations are benefiting Coinbase today. 

Understanding Producer-Consumer Architecture

A producer-consumer architecture is fundamental in many data-driven systems. It involves producers reading from a data source, placing data into a queue, and then fanning it out to subscribers who read from the queue. This design decouples producer and consumer logic, enabling parallel processing. 

A simplified diagram of a standard single producer, multiple consumer architecture can be found below:

Diagram 1

Our market data systems read the data from a single source, perform some standard computations such as grouping the orderbook data by aggregate sizes, and then sending the computed data across a gRPC streaming connection, which represent the end-users receiving real-time market data. The challenge lies in efficiently managing the buffer and communication with subscribers while reducing redundant allocations to avoid latency spikes during market volatility. 

Initial Exploration: Channels in Golang

Our first approach leveraged Golang's idiomatic channels for implementing the producer-consumer pattern. This setup is attractive because Go routines automatically block and suspend processing when there are no messages in the channel, preventing busy waiting by subscribers.

However, a critical observation was the impact of data copying. Golang will copy the message sent to the channel to ensure the consumer can safely modify the message, however the size of the trade struct was 80 bytes which resulted in drastic spikes of allocated memory and spotty performance during traffic surges.

Given the consumers never modified the data, the team re-implemented the producer to batch the data and send the batched data via a pointer. Golang will still copy the pointer for the consumer, however it will only incur a 4 byte overhead rather than the original 80 byte allocation cost per consumer previously seen. 

The green boxes in the diagram below represent the original implementation, with the blue representing the final evolution of the idiomatic Golang implementation, titled Simple Fan V21.

diagram 2

In general, the Idiomatic Golang solution provides easy to read logic and a relatively scalable solution which works for the majority of workloads. We were able to reduce memory overhead of the channel implementation by using pointers and batching the data. However, the team believed we could do even better by focusing on sharing data and optimizing reads.

Exploring Ring Buffers: A Deeper Optimization

Approach

Our pursuit of further optimization led us to the pioneers of the space, the LMAX Disruptor's  ring buffer. A ring buffer is a fixed-size circular buffer where the producer always writes to an entry, and subscribers read the data in the buffer at their own pace. The key benefit here is that subscribers look at the data rather than copying it, avoiding allocations.

diagram 3

Ringbuffers in practice 

At Coinbase, modeling our market data producer/consumer architecture after the LMAX ring buffer offers several compelling advantages, primarily focused on reducing latency and increasing throughput. Given that our market data systems are critical for providing users with the latest information for informed trades, the Disruptor's design aligns perfectly with our goals:

  • Reduced Latency: The lock-free nature and efficient handling of concurrent access in the LMAX architecture minimize delays in data propagation. This is crucial for real-time market data where every millisecond counts.

  • High Throughput: By avoiding traditional queue bottlenecks and minimizing contention, the Disruptor can handle a significantly higher volume of market data updates, especially during periods of high volatility.

  • Efficient Memory Usage: The in-place processing of data within the ring buffer, where consumers "look" at data rather than copying it, directly addresses the issues of redundant allocations and garbage collection pauses that we encountered with Go channels. This aligns with our observation that "the fastest operation is the one you don't do, or the one you do only once and reuse."

  • Scalability: The architecture's ability to support multiple consumers efficiently makes it well-suited for fanning out market data to a large number of subscribers without degradation in performance.

By adopting principles from the LMAX Disruptor, we aimed to build an even more robust and performant market data system, delivering crucial information to our users with minimal delay.

Implementation

As our market data systems are built in Golang, we began with a standard ringbuffer implementation, utilizing an underlying array sized to a power of 2. To replicate the blocking behavior of channels, we leveraged Go's `sync.Cond`. Subscribers invoke `Wait` on the condition when there's no data to read, and the producer calls `Broadcast` once a new batch is ready, waking all sleeping routines. While this approach introduces the possibility of a "thundering herd" problem, the system effectively manages the wake-up process due to the absence of locks in the read path.

Building upon these advancements, we further optimized the ringbuffer by integrating a `sync.Pool` for object reuse. This was particularly advantageous for our market data systems due to their inherent characteristics: high object churn, frequent garbage collection pauses during critical periods, and the prevalence of short-lived objects. By leveraging `sync.Pool`, we mitigated the performance impact of these factors, ensuring smoother operation. Additionally, we streamlined the read path by shifting the responsibility of state tracking to individual readers, further enhancing efficiency.

Comparative Results and Takeaways

TL;DR: The Ringbuffer implementation saw a 38x decrease in average execution time when compared to the original channel implementation.

We ran a myriad of benchmarks to determine the performance profiles of the optimized ring buffer approach (RingbufferV6) and the optimized channel implementation (SimpleFanV3). To start, we set the number of subscribers to a constant, then increased the number of trades. We then set the number of trades to a constant and increased the number of subscribers. The goal was to understand how the two approaches performed with different constraints. When the trade counts increase, we would expect to see higher allocations and more GC thrash in the channel implementation, however more subscribers would introduce more scheduling and communication overhead on the system.

The benchmarks revealed a striking difference between the two implementations. When considering the peak trade volume (10 million), the Ringbuffer saw a 10x decrease in allocations (allocs/ops) and a 24x decrease in average execution time (ns/ops) compared to the channel implementation. At the max subscriber benchmark (10 thousand), the Ringbuffer similarly outperformed the channel implementation with a 398x decrease in average execution time and a 10x decrease in allocations. We attempted to run benchmarks with an increasing number of subscribers, but the channel implementation was unable to handle any measurable increase in subscribers. 

The charts below compare the Ringbuffer to the channel implementation with 1,000 subscribers and a variable number of trades.

cpu time scaling
allocation count scaling
memory use scaling

Even for benchmarks with a varying number of subscribers and a set number of trades, the Ringbuffer continued to outperform the channel implementation, demonstrating significantly lower allocations and a more constant time per operation. While the Ringbuffer might use slightly more memory for a higher number of subscribers (a difference of about 10 megabytes), the 11x decrease in allocs/op and 398x decrease in ns/ops at peak scale more than justify the base memory increase.

The charts below show the performance of the ringbuffer and channel implementations with a constant 100 thousand trades and an increasing number of subscribers.

cpu time 2
allocation time 2
memory usage 2

Key Takeaways

  1. Benchmarking Pays Off: You don't know what you don't know until you measure it. Consistent and thorough benchmarking is crucial for identifying bottlenecks and validating optimizations.

  2. Do Less, Do Once: The fastest operation is the one you don't do, or the one you do only once and reuse. Reducing unnecessary allocations and operations is a powerful optimization strategy.

  3. Performance vs. Readability: Optimizing for raw performance often comes at the cost of readability. While the final Ringbuffer implementation is highly performant, it incorporates "niche" Go technologies like bitmasks and `sync.Pool` that require a deeper understanding. It's important to weigh these tradeoffs based on the specific needs and maintenance overhead of the system.

  4. Embrace the Process: Benchmarking and optimization can be a fun and rewarding process. Don't be afraid to experiment and explore different approaches but focus your efforts via benchmarking tooling to ensure you’re not pre-maturely optimizing.

The learnings from this project are not just theoretical; they directly contribute to the efficiency and responsiveness of our market trading systems at Coinbase. By rearchitecturing our market data systems to thoughtfully handle latency at scale using in-memory Ringbuffers, the systems are able to handle significantly more load while providing a lower latency user experience as evidenced by the analysis above. 

While there is a slight decrease in readability and corresponding increase in complexity, the choices made help provide a significantly more scalable and responsive trading experience for Coinbase users.

If these problems interest you - come work with us at Coinbase!

Check out some of our open positions: coinbase.com/careers

Coinbase logo