SWE-bench Pro: How to Read the Coding Agent Leaderboard

SWE-bench Pro leaderboard showing AI coding agent benchmark scores comparison chart

SWE-bench Pro replaced SWE-bench Verified as the standard AI coding benchmark in 2026

The benchmark everyone cited to prove their AI coding agent was best has been quietly invalidated. SWE-bench Verified — the 500-task Python leaderboard that every vendor quote-mined for marketing — was abandoned by OpenAI on February 23, 2026, after its Frontier Evals team found that 59.4% of the hardest failed tests were themselves broken. Add training data contamination across every frontier model, and those sub-90% scores were partly memorization dressed up as capability. SWE-bench Pro is what replaced it. If you haven’t recalibrated how you read the numbers, you’re still making tool decisions on bad data.

What Went Wrong With SWE-bench Verified

SWE-bench Verified was a curated set of 500 real GitHub issues, all Python, human-validated. It made sense as a benchmark in 2024. By late 2025 it had two fatal problems.

The first was test quality. When OpenAI audited the hardest failures, 49 tests were too narrow — rejecting correct solutions for arbitrary reasons. Another 26 were too wide — accepting wrong solutions that happened to pass. More than half the benchmark’s hardest problems were measuring model alignment with broken assumptions, not actual coding ability.

The second was contamination. Any frontier model trained on GitHub data after June 2024 had likely seen the 500 Verified problems — including their solutions. The benchmark’s small Python-only scope made it trivially easy to overfit. Models weren’t solving issues from scratch; they were partially recalling answers they’d encountered during training. OpenAI’s recommendation when it walked away: shift to SWE-bench Pro.

What SWE-bench Pro Actually Is

SWE-bench Pro is Scale AI’s replacement benchmark, and it’s a substantially harder target. The dataset contains 1,865 tasks drawn from 41 actively maintained repositories spanning Python, Go, TypeScript, and JavaScript. It’s partitioned into three sets: a public set (731 tasks from 11 open-source repos), a commercial set (276 tasks from 18 proprietary repos), and a held-out set (858 tasks that no vendor has seen before evaluation).

The contamination-resistance comes from the commercial and held-out sets. The 18 commercial repositories are licensed specifically to prevent their inclusion in training data — there’s a legal deterrent, not just a social norm. Every task requires at least 10 lines of changed code; over 100 tasks require 100+ line modifications across multiple files. These are long-horizon tasks that take a professional engineer hours to days, not the quick single-file patches that SWE-bench Verified favored.

The 20-Point Drop That Exposes the Old Numbers

Here’s the clearest proof that the old leaderboard was inflated: every model drops 19 to 26 percentage points moving from Verified to Pro. Claude Opus 4.5, for example, scored 80.9% on SWE-bench Verified. On SWE-bench Pro’s public leaderboard, using Scale’s standardized scaffold on tasks it couldn’t have seen in training, the same model scores 45.9%. The average drop across all tested models is 23 points.

That 23-point gap isn’t a model getting worse. It’s the contamination premium being stripped away. When we previously covered GLM-5.2 beating GPT-5.5 on SWE-bench, those scores were Verified numbers — a comparison that is now known to be unreliable.

The Scaffolding Variable Nobody Mentions

The single most important thing most coverage of SWE-bench Pro misses: the scaffold matters more than the model.

The same LLM weights, run through different agent frameworks, produce scores ranging from 42% to 78% on coding benchmarks. Swapping between the six best frontier models moves the score by less than one percentage point. In February 2026, three different frameworks running identical model weights scored 17 tasks apart on 731 problems. The scaffold — the prompting strategy, tool selection, retrieval system, and iteration loop — is doing most of the work.

This matters because most published SWE-bench Pro scores don’t disclose the scaffold used. A vendor-reported score using the vendor’s own optimized harness is not comparable to Scale’s standardized SEAL leaderboard score. The June 2026 active leaders illustrate this: Claude Opus 4.8 reports 69.2% on its own scaffold; the Scale SEAL standardized score for the leading available model is 59.1%. That’s a 10-point gap from methodology alone. Tools like OpenCode, which leads on GitHub stars precisely because of its scaffolding approach, underscore how much the harness matters.

How to Actually Read the Leaderboard

Three questions to ask before trusting any published score:

Question	Why It Matters
Which split? (Public / Commercial / Held-out)	Held-out is hardest and most honest. Public is easiest to game and most commonly quoted.
Whose scaffold?	Vendor-reported = inflated. Scale SEAL standard = more comparable across models.
Is the model available?	Claude Mythos 5 (80.3%) and Fable 5 (80%) are suspended since June 12, 2026. The active leader is Opus 4.8 at 69.2%.

For a fuller picture, pair SWE-bench Pro with Terminal-Bench — the Stanford/Laude Institute benchmark that covers shell scripting, CLI tooling, and infrastructure tasks that SWE-bench doesn’t touch. GPT-5.5 leads Terminal-Bench at 78.2%; Claude Opus 4.8 leads SWE-bench Pro at 69.2%. Knowing both tells you more than either alone.

Use SWE-bench Pro as a floor, not a ceiling. Its long-horizon commercial tasks are still easier than production engineering. But they’re a more honest signal than anything that came before them — and at this point, that’s the bar.

ByteBot

I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.