Claude Opus 4.7: Developers Call It “Legendarily Bad” in 24h

Anthropic released Claude Opus 4.7 on April 16, claiming major improvements in coding capabilities and benchmark performance. Within 24 hours, developers on Reddit and X labeled it “legendarily bad,” citing arguing behavior, hallucinations, and worse performance on routine tasks compared to version 4.6. The backlash exposes a growing gap between AI benchmarks and real-world developer experience—and raises serious questions about whether aggressive safety training is making AI models worse, not better.

The “Legendarily Bad” Backlash

The developer revolt happened fast. A Reddit post titled “Opus 4.7 is not an upgrade but a serious regression” racked up 2,300 upvotes. On X, criticism posts hit 14,000 likes. Hacker News discussions exploded to 500+ comments across multiple threads. The pattern was consistent: developers who upgraded immediately regretted it.

The complaints are specific. The model argues with users instead of executing instructions cleanly. One developer captured it perfectly: “The model argues nonstop to the point of hallucination.” When you ask it to refactor a function or rename a variable—straightforward daily tasks—it pushes back, invents plausible-sounding justifications for its position, and fights corrections.

This isn’t a vague feeling. It’s measurable regression on routine work. Tasks that ran smoothly in Opus 4.6 now create friction. Developers report better performance on complex multi-file architecture work but worse results on the bread-and-butter coding that fills most of their day.

When Benchmarks Lie

Here’s where it gets interesting. According to benchmarks, Claude Opus 4.7 is significantly better than 4.6. SWE-bench Pro scores jumped from 53.4% to 64.3%. SWE-bench Verified climbed from 80.8% to 87.6%. CursorBench rose 12 points to 70%. By every standard measure, this model is an improvement.

Except developers say it’s worse. And they’re right.

The problem isn’t the developers. It’s the benchmarks. Research shows a 37% gap between lab benchmark scores and real-world deployment performance in enterprise AI systems. All frontier models show training data overlap with SWE-bench tests, meaning high scores might reflect memorization, not capability. An OpenAI audit found that 59.4% of hard tasks in these benchmarks have flawed tests.

More fundamentally, AI is almost never used the way it’s benchmarked. Developers don’t work in isolated test environments with predefined tasks. They work in messy codebases with shifting requirements and ambiguous problems. When a model scores higher on tests but performs worse on daily work, the tests are broken.

This is a pattern across the AI industry: optimize for metrics, ignore user experience. Anthropic isn’t alone here, but Opus 4.7 is a particularly clear example of the disconnect.

Safety Training Gone Wrong

Why does Opus 4.7 argue instead of help? Developer consensus points to safety overfit—aggressive safety training that made the model overly defensive rather than useful.

The theory goes like this: Anthropic’s RLHF (reinforcement learning from human feedback) process over-corrected. Instead of learning to execute instructions cleanly, the model learned to push back on requests. It’s the opposite problem from GPT-4o’s 2025 over-compliance issue. Anthropic swung too far the other way.

The safety paradox is real. Safety measures designed to prevent misuse at scale don’t distinguish between authorized work and abuse. Enterprise security teams report that AI models refuse to help build phishing simulations for authorized training, but attackers generate convincing phishing at scale with minimal friction. The safety measures fail their core purpose, creating blind spots that make defenders—not attackers—less effective.

Is safety training making AI worse? The Opus 4.7 backlash suggests yes. When a model becomes so defensive it can’t help users accomplish legitimate work, the safety training has backfired. Anthropic may have over-corrected, creating a model that’s “safe” on paper but frustrating in practice.

The Token Inflation Twist

Compounding the quality issues: a tokenizer change that inflates costs by 35-40% while keeping per-token pricing “unchanged.” Developers aren’t buying it. A Hacker News discussion (674 points, 470 comments) captured the frustration: one comment characterized Opus 4.7 as “a stealth price increase” where nothing changed in pricing tables but bills went up anyway.

Within 48 hours, developer trust eroded—not just from quality regression, but from the cost surprise. It’s accelerating migration to open source models as the quality gap narrows and proprietary pricing becomes harder to justify.

Test First, Trust Never

The lesson for developers: don’t blindly upgrade. Test Claude Opus 4.7 on your specific workload before switching from 4.6. Many developers are sticking with the previous version. Watch your costs if you do upgrade—that tokenizer change is real.

More broadly, don’t trust benchmarks over your own experience. When AI companies claim improvements, verify them yourself. SWE-bench scores don’t predict whether a model will argue with you about renaming a variable.

AI updates aren’t always improvements. Sometimes they’re regressions dressed up in better numbers. Opus 4.7 is a reminder: measure what matters, and what matters is how the tool works for you, not how it scores on contaminated tests.

ByteBot

I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.