
A September 2025 study by METR found developers using AI coding tools were 19% slower at completing tasks—even though they predicted a 40% speedup before starting and still believed they were 20% faster after finishing. Meanwhile, GitHub claims 55% faster task completion with Copilot, Google reports 25% of its code is AI-generated, and Bain & Company describes real-world productivity savings as “unremarkable.” With 80% of enterprises adopting AI coding tools by 2026 and spending $114k-234k annually per 500 developers, the measurement problem is killing ROI.
Everyone’s measuring different things: vendors track “time to suggestion,” developers experience cognitive offload, researchers measure end-to-end throughput, and businesses care about features shipped. The productivity paradox isn’t about whether AI helps—it’s about whether anyone’s measuring the right things.
The Measurement Disconnect
Vendors optimize for metrics that make AI look great. GitHub’s 55% faster claim measured isolated task completion—no code review, no integration testing, no production deployment. It’s judging a car’s speed by how fast it accelerates in neutral.
Researchers measure differently. The METR study analyzed 16 professional developers across 246 tasks in massive codebases, tracking 140+ hours of screen recordings to identify friction: time spent formulating prompts, reviewing AI suggestions, integrating outputs, and context switching. Result: 19% slower despite developers feeling faster.
LinearB’s 2026 benchmarks analyzed 8.1 million pull requests and found AI PRs have a 32.7% acceptance rate versus 84.4% for manual PRs. AI code waits 4.6x longer before review. The bottleneck shifted from writing to reviewing—and the “productivity” gain disappears at merge time. If you measure “suggestions accepted,” ROI looks fantastic. If you measure “working code shipped to production,” ROI vanishes.
The Hidden Costs
GitHub Copilot costs $19-39/user/month, totaling $114k-234k annually for a 500-developer team. Microsoft is hiking Microsoft 365 Copilot to $39/user in mid-2026, forcing a “put up or shut up” moment on AI ROI.
Direct costs are just the start. LinearB’s data shows 67.3% of AI PRs get rejected versus 15.6% of manual PRs. If AI generates code 55% faster but 67% gets rejected, the net productivity gain is negative. A Hacker News developer summarized it: “There is more work to review all around and much of it is of poor quality. LLMs start fixing code that isn’t used and then confidently report that they solved the problem.”
When Bain & Company describes real-world savings as “unremarkable” despite vendor claims of 20-55% gains, it’s because hidden costs offset headline benefits. Companies spend millions without comprehensive metrics to evaluate whether they’re getting value or just generating more code to review.
Why Developers Feel Faster Despite Being Slower
The perception gap is measurable. METR’s developers predicted 40% speedup, experienced 19% slowdown, and still reported believing they were 20% faster. The reason: cognitive offloading creates the illusion of speed. AI handles boring parts—boilerplate code, syntax recall—making work more enjoyable. Developers confuse satisfaction with throughput.
AI works best in narrow contexts. Developers report “years worth of work in 2 months” on greenfield R&D projects where AI generates CRUD operations and configuration files. AI falls apart on legacy codebases with complex dependencies and security-critical paths. One HN developer noted: “LLMs are useful if you are knowledgeable and capable in the domain.” AI amplifies existing skills—weak developers produce more weak code faster, strong developers offload grunt work.
The Solution: Fix Measurement First
McKinsey’s 2025 State of AI report found nearly two-thirds of executives haven’t scaled AI programs. The reason isn’t lack of tools—it’s lack of measurement frameworks. The solution is DORA metrics (deployment velocity) plus SPACE framework (holistic productivity) plus AI-specific metrics: acceptance rate for AI PRs versus manual, review wait time, and time from suggestion to merged PR.
Real-world success comes from measurement rigor. Booking.com deployed AI to 3,500+ engineers and achieved 16% throughput increase by running pilot programs with A/B testing. Intercom realized 41% time savings by tracking not just speed but also code quality and developer satisfaction. Both measured productivity, impact, and satisfaction—the three dimensions experts recommend.
Best practice: Run pilot programs, A/B test teams with and without AI, and track project-level outcomes like features shipped and incidents resolved. Connect AI usage to business outcomes—revenue enabled, costs avoided. If metrics don’t align with business value, the tool isn’t working.
Key Takeaways
The hype cycle is over. 2026 marks the measurement maturity phase where enterprises demand proof of ROI. Developers feel 20% faster but measure 19% slower. Vendors claim 55% gains on isolated tasks while researchers measure negative throughput on real work. Direct costs run $114k-234k/year, hidden costs accumulate in review burden, and acceptance rates tell the real story—32.7% for AI versus 84.4% for manual.
The paradox won’t disappear. Context matters: AI helps on boilerplate and greenfield work, hurts on legacy codebases. Fix measurement first—align on DORA plus SPACE plus AI-specific metrics. Run pilot programs, A/B test, track business outcomes. If you measure “suggestions accepted,” you’ll buy more tools. If you measure “working code shipped,” you’ll optimize what matters.










