AI can generate thousands of lines of code in seconds, making “lines of code” as a productivity metric completely worthless. And yet, according to three major 2025 developer surveys covering 73,534 developers across 3,000+ organizations, engineering teams are still measuring the wrong things. Stack Overflow, JetBrains, and LinearB all released surveys this year documenting a seismic shift: the metrics that dominated for a decade are dying, and what replaces them determines whether your team thrives or burns out. The problem isn’t subtle—66% of developers believe current metrics don’t reflect their true contributions. That’s two out of three developers saying they’re being measured incorrectly.
Why Traditional Metrics Failed
Activity-based metrics were always flawed, but AI code generation made them obsolete overnight. The numbers tell the story: 41% of all code globally is now AI-generated, with 256 billion lines written in 2024 alone. Google reports 25% of its code is AI-assisted, and some startups are building products with 95% AI-generated code. When 92% of US developers use AI coding tools that can produce thousands of lines in seconds, counting lines of code becomes as meaningful as counting keystrokes.
The quality crisis compounds the volume problem. Code refactoring dropped from 25% in 2021 to less than 10% in 2024, while copy/pasted code rose from 8.3% to 12.3%. GitClear’s analysis of 153 million lines found that code blocks with five or more duplicates increased eightfold during 2024. As MIT Professor Maria Gonzalez puts it: “We’re optimizing for the wrong metric. Code volume tells you nothing about value delivery. An AI can generate a million lines of boilerplate faster than a human writes a hundred lines of critical business logic.”
The paradox becomes clear: we might hit 90% AI-generated code by volume while the remaining 10% of human code handles 60% of the complexity. Measuring commits, pull requests, or velocity points doesn’t solve this. These metrics incentivize wrong behavior—meaningless commits, artificially split PRs, and gaming the system—because activity doesn’t equal outcomes, speed doesn’t guarantee quality, and volume doesn’t create value.
The New Frameworks: DORA, SPACE, DX Core 4
Three frameworks emerged to replace activity metrics, each offering a different approach to measuring what actually matters.
DORA metrics, created by Google’s DevOps Research and Assessment team, focus on software delivery performance through four measures: deployment frequency (how often you ship), lead time for changes (commit to production time), change failure rate (percentage of deployments causing issues), and mean time to recover (how fast you fix failures). The benchmark for quality is a 5-10% change failure rate, and elite DORA performers are twice as likely to meet organizational targets. The strength is objectivity—quantitative metrics with industry-standard benchmarks. The limitation is what it doesn’t measure: developer well-being, satisfaction, or the “why” behind performance changes.
The SPACE framework, developed by Microsoft and GitHub, takes a holistic approach across five dimensions: Satisfaction (developer happiness and health), Performance (both outputs like features shipped and outcomes like user satisfaction), Activity (coding, testing, collaboration), Communication (team interaction effectiveness), and Efficiency (ability to work in flow state with minimal interruptions). The core principle: “Productivity cannot be reduced to a single dimension or metric.” Organizations should select metrics from at least three categories, mixing quantitative measures like Git data with qualitative measures like developer surveys. SPACE prevents over-optimization of any single dimension but requires both system data and survey data, making it more complex to implement than DORA.
DX Core 4 consolidates DORA, SPACE, and DevEx into a unified approach measuring four balanced dimensions: Speed (code to production velocity), Effectiveness (how efficiently developers work), Quality (reliability and stability), and Impact (business value from engineering work). The distinguishing feature is integration—quantitative metrics plus qualitative insights plus business outcomes in one cohesive system. The results are documented: Booking.com quantified a 16% productivity lift from AI adoption, Adyen achieved measurable improvements across 50% of teams in just three months, and over 300 organizations using DX Core 4 reported 3-12% efficiency gains.
All three surveys found the same insight: top companies use both quantitative and qualitative measures. You need system data and developer feedback, not one or the other.
The Critical Shift: Time to Value Over Deployment Frequency
The biggest revelation in 2025 isn’t which framework to use—it’s what to measure within those frameworks. Engineering teams realized that shipping fast isn’t enough. Features must drive impact quickly. You can deploy ten times per day and still fail if customers don’t adopt what you’re shipping. Deployment doesn’t equal value realization.
“Time to value” measures the duration from idea defined to deployed to customers actually using it successfully. What top teams track: feature usage rates post-deployment, revenue impact within X days of launch, user satisfaction scores after release, time to meaningful adoption (not just deployment), and business KPIs linked to specific releases. The principle is simple: raw speed doesn’t equal success unless it’s tied to delivering customer value. A team can hit all its deployment goals but still fall short if customers struggle to adopt what’s being shipping.
This shift from deployment frequency to time to value represents the fundamental change: outcomes matter more than outputs. “Deployed five features” is an output. “Those five features drove a 10% increase in user engagement” is an outcome. The metric that matters is the latter.
What the Surveys Reveal
Stack Overflow’s survey of 49,000+ developers across 177 countries found that 80% now use AI tools, though positive sentiment dropped to 60% from 70%+ in prior years—the first decline. OpenAI’s GPT models dominate at 81.4% usage, while Claude Sonnet ranks second at 42.8%. Technology adoption shifted dramatically: Python jumped 7 percentage points year-over-year (driven by AI and data science), and Docker saw the largest increase of any technology surveyed with a 17-point jump to 71.1% adoption.
JetBrains surveyed 24,534 developers across 194 countries and found that 85% regularly use AI tools for coding, with 62% relying on at least one AI assistant. Time savings are substantial: 90% save at least one hour per week, and 20% save eight or more hours—equivalent to a full workday. But the critical finding is the measurement crisis: 66% believe current metrics don’t reflect true contributions. Two-thirds of developers are saying they’re being measured incorrectly, and both technical factors (51%) and non-technical factors (62%) matter for productivity.
LinearB analyzed over 6 million pull requests from 3,000+ organizations across 32 countries, providing real engineering data from Git and CI/CD systems rather than self-reported surveys. The insight: “Benchmarks aren’t law—mobile teams work very differently than hardware teams or API teams.” Context matters. Your situation (startup versus enterprise, B2B versus B2C) determines which benchmarks apply.
What Developers Should Do
For individual contributors: Understand what you’re being measured on. Ask your manager what metrics they track and why. Push back on pure activity metrics like commits, lines of code, or PR count because they incentivize wrong behavior and are easily gamed. Advocate for outcome metrics: “Did my work create business value? Did users adopt my feature? Did this reduce support tickets?” Track your own satisfaction because burnout kills long-term productivity, and if you’re hitting metrics but miserable, something’s wrong.
For engineering managers: Combine DORA and SPACE or use DX Core 4. DORA alone risks prioritizing speed over quality. SPACE alone is too broad and hard to benchmark. The best approach balances delivery metrics with developer experience. Use industry benchmarks as context, not law. Collect both quantitative data (Git, CI/CD, deployment metrics) and qualitative data (surveys, one-on-ones, retrospectives). Don’t optimize one metric at the expense of others—balance speed, quality, and developer well-being. Measure outcomes, not just outputs.
The universal truth: developer productivity is not about how much code is written. It’s about how much value a team can deliver sustainably. Metrics should inform decisions, not dictate them. Measure value, not volume.

