Resolution
We took immediate steps to mitigate the impact of the incident as much as technically possible. Coinbase engineers were paged at 3:08AM ET, and all service owners participated to reroute traffic so they could be used by customers. This work included targeted mitigations according to our standard recovery procedures such as: disabling auto cluster consolidations, locking deploys, load shedding non-critical services, and manually assigning our fixed capacity pool to remaining services.
All Coinbase services were restored at 6:45pm ET after AWS implemented fixes to multiple systems that Coinbase relies on, in particular, EC2’s network state propagation.
The work done by engineers in the first wave prevented significant impact during the second wave of this widespread outage. To be better prepared in the future, we are exploring all options, including reviewing our regional deployment strategy to implement immediate, and long term fixes to reduce the impact of these types of outages.