Building enterprise AI agents at Coinbase: engineering for trust, scale, and repeatability

By Jason Dodds

Engineering

, December 22, 2025

Coinbase

Over the course of a six-week period, the Enterprise Applications and Architecture team at Coinbase formed an Agentic AI Tiger Team tasked with paving roads for building and hosting AI agents, developing best practices and design patterns, and setting a blueprint for other teams to follow.  We focused on building out process automation agents in the Institutional support, Onramp onboarding, and Listing legal review spaces in order to have diversity in use cases and team execution.

Simply stated, enterprise AI agents are just software services — but with one big twist: not only do they need the same rigor we apply to any production system, we must also think about interpretability and auditability for human and regulatory trust. That framing drove how we built enterprise AI agents, consequently standardized their development, and why we ultimately leaned into a code‑first approach for the automations we wanted to run at scale. 

What makes enterprise agents different

In many consumer product surfaces you can often move fast, iterate in the UI, and accept a bit of drift. Inside a company, agents interact with business data and can automate portions of previously human‑only workflows. That means they must be hosted in our infrastructure (which is geared toward Golang service hosting), versioned through our pipelines, observable end‑to‑end, evaluated in a repeatable way, and auditable down to inputs and decision traces to ensure safe and secure use. The outcome is an agent graph where LLM calls are one node in a larger, testable, monitored system — by design. 

Why we chose high‑code over low‑code for core automations

We experimented with both patterns. Low‑code tools are great for discovery and rapid prototyping; you can wire up tools quickly and learn fast. But the more tools and instructions you load into a prompt, the more “context noise” you introduce,  making outputs harder to reproduce and individual steps harder to unit test or gate in CI. That’s fine for exploration, less ideal for long‑running, operational flows. 

For the automations we intended to scale, high‑code won out. Code‑first graphs (e.g., LangGraph/LangChain patterns) gave us typed interfaces, version control, clean separation of “data” nodes from “LLM” nodes, and the ability to attach observability, evaluation, and human‑in‑the‑loop controls as first‑class concerns. In short, we could engineer agents like services, not like chats.

A six‑week push to pave the roads

We formed a focused team with a simple mandate: ship a handful of meaningful automations, and in doing so, standardize the stack — hosting, observability, testing, human review, and audits — so any team could build an agent the same way. We prioritized categories of work that were manual, time‑consuming, and decision‑heavy: think “summarize and triage,” “collect and compare,” and “draft with references for a human to approve,” across a few different internal domains. 

By the end of the six weeks, two automations were in production with measurable time savings, two more were completed end‑to‑end in development, and we’d published reference implementations and onboarding materials other teams could adopt. But most importantly, once those paved roads were in place, the time to build a new agent dropped from quarters to days — and more than half a dozen engineers were able to self-serve on the patterns.

We built this with an “observability‑first” mindset. Every tool call, retrieval, decision, and output is traced. Data‑fetch and transform steps are deterministic and unit‑tested. LLM steps run with evaluation harnesses and curated datasets. We use a second LLM as a judge for spot‑checks and confidence scoring. And we treat human review as an intentional part of the system, not a workaround.

We designed for auditability from the start, which is essential in our legal and regulatory flows and best practice for all other flows.  An immutable record showing which data was used, how it was used, the reasoning the agent followed, and who approved the output is created for each execution of an agent. Once we built these capabilities for one agent, we were able to leverage auditability for all of them. That lets us meet today’s requirements with the least human friction — while building confidence to reduce that friction over time.

What we standardized (so teams don’t have to rediscover it)

The playbook we landed on is simple to explain, yet opinionated enough to scale.

  • Build the “job description” before the agent. We write the agent’s SOP first — what “good” looks like, what sources it can use, where it must defer to a human. If a new hire couldn’t succeed with that SOP, an agent won’t either.

  • Engineer the graph, not the chat. Use code‑first graphs to separate deterministic data nodes (unit‑tested) from probabilistic LLM nodes (evaluated). That split keeps failures diagnosable and runs reproducible.

  • Treat observability as a requirement. Trace everything, diff runs across versions, and store artifacts with the run. You can’t tune what you can’t see.

  • Keep a human in the loop — intentionally. Design the handoff and feedback loop into the UX. Long‑tail tuning is real, and it’s faster when feedback is captured where work happens.

  • Design for auditability on day one. Reference every claim to a source, and tie the output to the exact inputs, tools, and reasoning path. It’s the shortest path to both compliance and trust.

  • Prefer the simplest viable runtime. In practice, Python‑only builds let us move quickly; we only add complexity (e.g., sidecars) when the use case demands it.

How this complements other AI investments at Coinbase

Our work sits alongside Coinbase’s broader AI efforts rather than overlapping with them. You may have read about our multi‑agent decision support that augments internal decision documents with explainable, auditable analysis, and our testing agents that boost product quality by autonomously executing scenarios and self‑evaluating findings. Those are excellent examples of AI in decisioning and quality engineering. The paved roads here focus on a different need: reliable, observable internal automations that interact with business systems and free teams from recurring, human‑intensive work.

Results

In six weeks, we proved the pattern on multiple internal automations, put two into production with more than 25 hours saved per week, completed two more end‑to‑end in development, and published the paved roads that cut new build time from 12+ weeks to under a week while upskilling engineers to self‑serve. Our investment in building out a hosted LangSmith implementation was adopted by the AI team at Coinbase, so we now have a consistent, company approved observability platform for all our agents.  That’s the engine we cared about building: something repeatable that raises the floor for everyone.

EAA agent diagram

Example architecture for agent platform

Where do we go from here?

Going forward, we plan on refining our infrastructure and design patterns, leveraging Python-only agents with hosting in AWS Bedrock AgentCore. We will advance our use cases for tracing, evaluation and logging leveraging LangGraph.  We also have roadmap items to build internal tooling to leverage content management platforms and document repositories along with external data to build a knowledge graph that supports various use cases.

Closing thought

The big lesson for us was not “LLMs can do X,” but that agents are a software discipline. When we treat them that way — hosted properly, observable end‑to‑end, testable where deterministic and evaluated where probabilistic, with human oversight and auditability — we get the best of both worlds: speed where it helps, and rigor where it matters. That’s how we move fast and still meet Coinbase’s bar for trust and scale.

Coinbase logo