Cloudflare just published 30 days of production data on its AI code review system: 131,246 review runs across 48,095 merge requests, at a median cost of $0.98 and a median completion time of 3 minutes 39 seconds. Not a beta. Not a proof-of-concept. Full production, across all 5,169 Cloudflare repositories. The numbers are hard to argue with.
The Problem It’s Solving
AI coding tools made a specific problem dramatically worse. Teams using AI assistants are generating 98% more pull requests while their review time has climbed 91%, according to LinearB’s 2026 analysis of 8.1 million PRs. AI-generated PRs wait 4.6x longer before a reviewer even picks them up. The tools that were supposed to speed teams up created a new, very human bottleneck. As Cloudflare’s engineering team put it: “Code review is a fantastic mechanism for catching bugs and sharing knowledge, but it is also one of the most reliable ways to bottleneck an engineering team.”
The Architecture: Specialists Beat Generalists
The core insight is not that Cloudflare used AI for code review — it’s how they structured it. Rather than pointing one model at a diff with a generic prompt, they run up to seven specialized agents per merge request:
- Security — flags only exploitable or concretely dangerous issues; ignores theoretical risks
- Code Quality — logic errors and best practices
- Performance — efficiency concerns
- Documentation — completeness and clarity
- Release Management — deployment readiness
- Compliance — adherence to Cloudflare’s internal Engineering Codex
- AGENTS.md — whether the repo’s AI instruction file needs updating
A coordinator agent — running on Claude Opus 4.7 or GPT-5.4 — reads all seven outputs, deduplicates overlapping findings, re-categorizes issues, filters out speculative noise, and posts a single structured review comment. The coordinator is the only component running frontier-tier models; heavy-lifting sub-reviewers run on Claude Sonnet 4.6 or GPT-5.3 Codex, and text-heavy agents like Documentation run on Kimi K2.5 to keep costs down.
This specialization matters more than it sounds. A single model with a generic “review this code” prompt is essentially being asked to be a security expert, a documentation auditor, and a compliance checker simultaneously. Specialization produces fewer but higher-quality findings — 1.2 per review on average, with the security reviewer producing the highest critical-issue rate at 4%.
The Economics
The system is also risk-tiered, which is what makes the unit economics work:
| Tier | Lines Changed | Agents | Median Cost |
|---|---|---|---|
| Trivial | ≤10 lines | 2 | $0.20 |
| Lite | ≤100 lines | 4 | $0.67 |
| Full | >100 lines or security-sensitive | 7+ | $1.68 |
You do not send Claude Opus to review a README typo fix. Security-sensitive files — anything touching auth/ or crypto/ directories — always trigger full review regardless of diff size.
The team processed roughly 120 billion tokens per month and kept costs manageable through an 85.7% prompt cache hit rate, saving an estimated five figures monthly. The trick: instead of duplicating the full MR context across all seven concurrent agents, they write it to disk once and have each agent read the shared file — eliminating a 7x token multiplication.
What It Still Cannot Do
Cloudflare is refreshingly direct about the limitations. The system struggles with architectural awareness — it sees the diff but not the design intent behind it. It cannot verify that all downstream consumers of an API have updated when a contract changes. It catches obvious lock misses but not subtle deadlocks. And a 500-file refactor run through seven frontier models costs real money.
The break-glass override — where a comment of “break glass” forces approval regardless of AI findings — was used only 288 times across 48,095 merge requests (0.6%). Engineers almost never need to override it.
What This Means for Engineering Teams
The architecture here — multi-agent, specialized, coordinator-synthesized — is the template that other engineering teams will copy. The specifics (OpenCode, GitLab, Cloudflare Workers KV for control plane) are Cloudflare-specific, but the pattern is transferable. Single-model generic code review produces noise. Seven specialized agents with a coordinator produces a review that engineering leads at Cloudflare are actually relying on.
The deeper issue Cloudflare’s post surfaces: the code review bottleneck is the unsexy problem that actually determines whether AI-assisted development delivers on its productivity promise. Generating code faster is worthless if it piles up waiting for review. At $1.19 per review with a 4-minute turnaround, Cloudflare has a credible answer to that problem — built on top of OpenCode, the open-source agent that any team can start with today.













