Booking.com deployed AI coding tools to 3,500 engineers and faced the question every tech leader is asking: “Is this actually working?” Most companies rely on developer surveys claiming they “feel 35% faster,” but feelings don’t justify million-dollar budgets. Using the newly launched DX Core 4 framework, Booking.com got hard numbers: 16% higher PR merge rate for AI users and 150,000 hours saved in year one. This is the first enterprise case study to definitively prove AI productivity gains—with a methodology anyone can replicate.
The Four Dimensions That Settled the Debate
DX Core 4 unifies three existing productivity frameworks—DORA, SPACE, and DevEx—into four balanced dimensions. Launched in January 2025 with backing from DORA creator Nicole Forsgren, it solves the “what should we measure?” problem that left leaders answering “it depends.”
The framework measures productivity across Speed (diffs per engineer—how fast code ships), Effectiveness (Developer Experience Index or DXI—a 14-factor survey measuring workflow friction), Quality (change failure rate for production stability), and Impact (time spent on new capabilities, connecting engineering work to business value).
The innovation lies in “oppositional metrics” that counterbalance each other. Speed metrics are balanced by Effectiveness scores, preventing teams from optimizing for velocity at the expense of developer satisfaction. One-point DXI increase translates to 13 minutes saved per week per developer. Top-quartile DXI teams demonstrate 4-5 times greater speed and quality compared to bottom-quartile teams.
Already deployed at hundreds of organizations since January, Core 4 provides prescriptive metrics instead of framework customization exercises.
Booking.com’s Proof: 16% Boost, 150K Hours Saved
Booking.com’s 3,500+ engineers initially met their AI tool rollout with skepticism. Usage lagged expectations, and the company lacked data infrastructure to measure effectiveness. “We needed to understand how AI was affecting engineering velocity, satisfaction, and code quality,” explained Leo Kraan, Director of Engineering at Booking.com in the official case study.
The methodology was straightforward: Booking.com expanded their DX dataset to include AI tool metadata and tracked daily usage frequency—not just whether developers tried it once, but whether they used it consistently. Developers using AI tools more than 12 days monthly showed significantly improved effectiveness compared to occasional users.
The results validated the investment. Daily active AI users achieved a 16% higher PR merge rate than non-users. Over the first year, this translated to 150,000 developer hours saved with a 65% adoption rate across nearly all engineers. Later optimizations pushed that improvement to 31% total.
“Without DX’s data, we wouldn’t be able to confidently talk about the ROI,” Kraan noted. Zane Wright, Senior Product Manager, added that “DX has helped us drive decisions like which vendors we select.”
The framework didn’t just measure productivity—it enabled business decisions. Booking.com now uses Core 4 metrics to evaluate competing AI vendors, structure enablement programs, and justify continued investment with concrete evidence.
Why DORA, SPACE, and DevEx Weren’t Enough
The productivity measurement wars created a gap. DORA metrics excel at CI/CD pipeline optimization but ignore developer experience. SPACE framework offers comprehensive coverage but requires heavy customization. DevEx addresses developer satisfaction but lacks output metrics.
When executives asked DX CEO Abi Noda “what exactly should we measure?”, he kept answering “it depends.” That ambiguity doesn’t work when CFOs are evaluating million-dollar AI tool contracts.
DX Core 4 synthesizes the best parts of each predecessor: DORA’s prescriptive metrics, SPACE’s multidimensional thinking, and DevEx’s developer-experience focus. The result deploys “in weeks, not months,” leveraging readily-available system metrics from Git and CI/CD pipelines combined with standardized 14-item DXI surveys.
What 16% Means as the New Baseline
Booking.com’s results establish 16% as the reference point for AI tool ROI. Companies can now compare their results against this benchmark instead of relying on anecdotal “35% faster” claims.
The framework is already shifting vendor dynamics. Booking.com used Core 4 to decide between competing AI assistant vendors, creating measurable performance criteria. That competitive pressure forces tool providers to prove value instead of marketing promises.
The next frontier is code quality. Booking.com is investigating correlations between AI-generated code and vulnerability rates. The question is evolving from “does it work?” to “how safe is it?” That shift matters—productivity gains mean nothing if they introduce technical debt.
How to Replicate Booking.com’s Approach
Start with the 14-item DXI survey to establish a baseline (2-4 weeks). Add system metrics by pulling data from existing Git repositories and CI/CD pipelines. Track AI coding assistant usage metadata to identify daily active users versus occasional tryers. Compare cohorts (AI users vs. non-users) to measure impact.
Critical safeguards matter. Never use these metrics for individual performance reviews—they’re designed for team insights only. Don’t set targets or rewards tied to throughput metrics. Transparent communication is essential: these metrics exist to identify friction points and improve developer experience.
Expect meaningful trends within 3-6 months, though initial insights emerge after 1-2 months of data collection.
The Framework That Finally Proved AI Works
DX Core 4 unifies DORA, SPACE, and DevEx into four balanced dimensions: speed, effectiveness, quality, and impact. Booking.com deployed it across 3,500 engineers and proved a 16% productivity boost with 150,000 hours saved in year one—the first enterprise case study to definitively measure AI tool ROI at scale.
The methodology is replicable. Start with DXI surveys, add system metrics, track daily usage, and compare cohorts. The framework prevents gaming through oppositional metrics that balance output with experience.
Sixteen percent is now the baseline. Every AI tool vendor, engineering leader, and developer can reference Booking.com’s results when evaluating whether AI assistants justify their cost. The answer, backed by data instead of feelings, is yes—when measured correctly.










