Coinbase Logo

Language and region

How to Move Fast with Confidence

By Zhiwei Li

, May 28, 2025

Coinbase

“It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so. ” – Mark Twain

Moving fast is easy, and so is avoiding mistakes. But achieving both together? That’s the real challenge. Coinbase’s mission to increase economic freedom in the world hinges on mastering both in tandem.

Purpose Engineering

Does it occur to you that no matter how many engineers you add or how fast the CPU, GPU, and RAM, it can still be difficult to pick up speed without breaking things? 

It turns out what really slows things down isn’t always the complexity of the systems. It’s the unpredictability they introduce in how everything interacts that keeps you from moving fast with confidence. When we layer on more resources without addressing this inherent unpredictability, we risk creating more chaos than clarity. 

At Coinbase, we introduced "purpose engineering" as a forward-thinking practice designed to tackle unpredictability head-on. This approach leverages O2, our real-time anomaly detection platform, to (1) codify expectations for our systems and (2) receive instant alerts if anything violates these expectations.

This clarity and instant feedback allow our teams to move quickly and confidently, so we can chase new opportunities without the fear of the unknown unknowns.

Our Approach

A system must consist of three key components: elements, interconnections, and a purpose. We typically model elements as data structures and interconnections as RPC (Remote Procedure Call) or messaging. However, we lacked a tool to reason about the system's purpose. 

A system's purpose is rarely expressed explicitly and clearly; instead, it often emerges through its operation, typically in the form of incidents. That’s why we can have an epiphany 10 minutes after an incident, rather than through months of development. We don’t suddenly become smarter; it’s that when we’re forced to think about the system’s purpose, the underlying logic becomes evident.

O2 is designed to fill this gap, equipping us with a way to clarify the system's purpose ahead of any incident, and notify us when it deviates from expected behavior.

Modeling Purpose with Causality Invariants

To achieve this, we define purpose in terms of causal relationships: 

(Cause) Event A (Effect) Event B (at time t)

In this model, each event triggers another, allowing us to articulate not just isolated actions, but also the intended system behavior and the interactions between multiple systems over time. This causal structure makes it possible to predict how the system should respond under various conditions, helping us identify when it deviates from the intended outcome. 

Example: Sending BTC on time

(Cause) Send BTC (Effect) BTC transfered within 2 hours

The following policy in our in-house DSL (Domain-Specific Language) captures the expectation that a BTC P2P send, once initiated, should be processed within 2 hours.

send btc on time code

Example: Selling Crypto

When a user sells their crypto (e.g., BTC), they should receive the correct amount of FIAT, calculated by multiplying the BTC price by the amount sold, less any commission fees.

(Cause) Sell Crypto (Effect) Credit user with (Price × Amount) - Commission Fees
sell code

Architecture

architecture

The O2 platform is built on an event-sourcing architecture, functioning as an ETC ("Extract, Transform, and Check") data pipeline. Events are captured in real-time, serving as the single source of truth. These events from all different sources are then transformed into a universal format for verification, and detected inconsistencies are reported. By decoupling the ETC components, the platform enables independent iteration on each part without disrupting the overall system. This flexibility, combined with the event-sourcing foundation, ensures the platform can evolve while seamlessly onboarding new use cases.

Design Principles

Occam's Razor

We apply Occam's Razor—the idea that the simplest solution is often the best—to ensure that (causality) invariants are as straightforward as possible. Keeping invariants simple makes them easier to understand, maintain, and verify, while minimizing the chance for errors. This simplicity not only enhances clarity but also keeps the invariants stable and consistent as the system evolves.

To achieve this, we decouple data processing and implementation from business logic. By separating the concerns of data handling, implementation, and migrations, we ensure that changes in these areas don’t impact the core invariants. In the end, Occam's Razor helps us stay focused on what truly matters, cutting out unnecessary complexity that can get in the way.

Logic-on-Write

We embrace the "logic-on-write" principle to keep our data handling simple and efficient. By applying transformations and computations early in the ingestion process, we eliminate the need for heavy post-processing and avoid storing unnecessary raw data. Every piece of information we capture is directly tied to invariant checks, ensuring nothing is wasted. This approach lets us pre-join and enrich data on the fly, making the pipeline smarter from the outset. The result is a streamlined system that stays focused and effective.

Logical Clocks

We use the "logical clocks" design principle to process events based on their intrinsic event time, rather than the system's wall clock. This ensures that variations in processing time, caused by delays or failures, do not affect the logical order of events. By processing events based on their event time, we ensure the output is a direct function of the input data, resulting in a deterministic outcome. This approach also makes it easier to reason about historical data. 

To handle late and out-of-order data, we use a watermark strategy that builds on logical clocks. A watermark defines the latest point in event time where earlier events are no longer expected, allowing us to process data within a bounded time window. This keeps the system efficient and ensures the event sequence remains consistent and deterministic.

In-Memory

With logic-on-write and logical clocks, we minimize data size and limit its lifecycle, making it possible to store everything in memory. This improves streaming processing, allowing real-time checks and invariant enforcement. Persistent storage no longer controls performance, keeping operations fast and memory-driven.

Conclusion

Today, O2 handles over 1 billion events daily and powers more than 50 diverse use cases across Coinbase. In the past 6 months, it has detected unintended results in 10+ scenarios, avoiding millions of dollars in potential financial impact. 

To move fast with confidence is to remember the quote from the beginning and meticulously call out the causality invariants that define our success. Purpose engineering (through O2) provides exactly the framework to do just that, equipping us to act quickly and decisively, without fear of overlooked risks.

Acknowledgements

This work would not have been possible without the support and contributions of many. Thanks to the members of the Financial Hub, Finance, Analytics, International, Exchange, and Data Platform teams for their dedication, contributions, and unwavering commitment to excellence. The collective effort and support from across teams were essential to the success of this work.

Coinbase logo