On January 7, 2026, Cambridge student AcerFur announced that OpenAI’s GPT-5.2 had “autonomously” solved Erdős problem #728—potentially the first AI to crack an open mathematical conjecture before any human mathematician. The claim hit Hacker News front page with 326 points, sparking celebration across tech circles. But three critical caveats buried in the announcement thread tell a far more complicated story.
The original problem statement was ambiguous, and the AI solved the mathematical community’s interpretation of what Erdős “probably meant”—not the original problem. The solution heavily relies on mathematician Carl Pomerance’s 1996 work. GPT-5.2 Pro was used to “format” the proof, possibly validating and filling gaps along the way. This is a perfect case study in AI hype versus reality.
The Fine Print Nobody Reads
AcerFur listed three caveats in the announcement: The original Erdős problem #728 is “quite ambiguous,” so the AI solved an interpretation. The solution uses “arguments similar to those from Pomerance (2015)” [sic—actually 1996], meaning it’s heavily based on existing techniques. Furthermore, GPT-5.2 Pro formatted the proof into LaTeX, “possibly validating and filling in gaps during the formatting process.”
These aren’t minor footnotes. They fundamentally undermine the “autonomous AI solving” narrative. If the problem statement needed human interpretation, the AI didn’t autonomously navigate ambiguity. If the solution recycles Pomerance’s techniques from 30 years ago, the AI didn’t discover new mathematics. If GPT-5.2 Pro filled gaps during “formatting,” that’s not formatting—that’s fixing the proof.
“AI solves math problem” is a very different claim from “AI applied existing techniques to solve an interpreted version of an ambiguous problem with multi-stage AI assistance and human guidance.” The gap between those two statements is where AI hype lives.
This Is Not The First Time
In October 2025, OpenAI researcher Sebastien Bubeck claimed GPT-5 had solved Erdős problems. The claim received 100,000 views before it unraveled. Mathematicians discovered GPT-5 hadn’t solved anything—it had crawled the web for existing solutions and retrieved them. Literature search, not mathematical discovery.
Thomas Bloom, who maintains the Erdős problems website, clarified: “GPT-5 found references, which solved these problems, that I personally was unaware of.” Bubeck deleted his tweet and backtracked. Moreover, Gary Marcus observed, “I don’t know anybody who believes his retrenchment.” DeepMind CEO Demis Hassabis called the incident “embarrassing.”
The GPT-5.2 incident follows the same pattern: Breathless initial claim, expert scrutiny, caveats emerge, reality is far less impressive. The difference is that this time the caveats were disclosed upfront—credit to AcerFur for transparency—but the headlines still ran with “AI solves math problem.”
Related: AI Code Verification Crisis: 96% Distrust, 48% Verify
The “Low-Hanging Fruit” Reality
Fields Medalist Terence Tao provides critical context: “AI tools are now capable enough to pick off the lowest hanging fruit amongst the problems listed as open in the Erdős problem database, where by ‘lowest hanging’ I mean ‘amenable to simple proofs using fairly standard techniques.'” This is not AI discovering profound mathematics. This is pattern matching applied to accessible problems.
In November 2025, Harmonic’s Aristotle AI solved Erdős #124. Thomas Bloom noted it was “the easier of two variants posed by Erdős” and the solution “turned out to be relatively straightforward in hindsight, comparable in difficulty to mathematical competition problems where AI has already demonstrated strong performance.”
The numbers back this up. GPT-5.2 scores 77% on FrontierScience-Olympiad, which tests competition-level math. However, on open-ended Research tasks requiring genuine mathematical insight, it scores 25%. That 52-point gap explains everything. Consequently, AI can handle structured problems with standard techniques but struggles with genuine research.
AI is a tool for accelerating routine tasks—literature search, proof formalization, checking—not a replacement for mathematical insight. Understanding this distinction prevents overestimating AI capabilities in your own work.
What “Autonomous” Really Means
The GPT-5.2 solution involved a multi-stage AI pipeline. GPT-5.2 generated the initial proof. Harmonic’s Aristotle system verified and “autonomously repaired” it. GPT-5.2 Pro formatted the proof with possible gap-filling. Meanwhile, human mathematicians interpreted the ambiguous problem statement. AcerFur coordinated the verification effort.
Headlines: “GPT-5.2 autonomously solves.” Reality: Multi-model pipeline with human coordination at every stage. This mirrors a broader trend in AI—multi-stage systems with human oversight marketed as “autonomous.” Similarly, “AI coding agents” involve IDE plugins, LLM calls, linters, and human review but get labeled “autonomous coding.”
When evaluating AI tools, ask: How many models are involved? Where does human intervention happen? What does “autonomous” actually mean here? The answers matter for assessing true capability and cost.
Don’t Trust The Headlines
Gary Marcus’s principle applies here: “Extraordinary claims require extraordinary evidence.” When you see “AI breakthrough” headlines, apply this framework. First, what exactly was solved? Was it the original problem or an interpretation? Second, how much human involvement occurred? Interpretation, coordination, validation? Third, is the solution novel or derivative? New techniques or existing methods? Fourth, is this autonomous or multi-stage? Single AI or pipeline with human checkpoints?
Apply this to GPT-5.2: Solved an interpretation of an ambiguous problem, not the original. Heavy human involvement—interpretation guidance, verification coordination. Derivative—Pomerance 1996 techniques. Multi-stage pipeline—GPT-5.2, Aristotle, GPT-5.2 Pro, human coordination. Result: Interesting technical achievement, not the breakthrough headlines suggest.
Developers need critical thinking skills for AI claims. Vendors will oversell. Your job is to see through marketing and assess actual capability. Additionally, pattern recognition matters. Check if this happened before. Spoiler: It has, repeatedly, and the pattern is always the same—initial hype, expert correction, quiet walkback.
Key Takeaways
- Read the caveats, not just the headlines—AcerFur disclosed three critical limitations upfront, but most coverage ignored them
- Pattern matters: October 2025’s GPT-5 “Erdösgate” (literature search marketed as solving) follows the same hype → correction → walkback cycle
- AI excels at structured, competition-level math (77% on Olympiad tasks) but struggles with genuine research requiring novel insights (25% on Research tasks)
- “Autonomous” often means “multi-stage pipeline with human oversight at every checkpoint”—GPT-5.2 required Aristotle, GPT-5.2 Pro, human interpretation, and verification coordination
- Critical thinking beats hype every time: Apply the four-question framework (What? How much human? Novel? Autonomous?) to every AI breakthrough claim
The gap between AI’s actual capabilities and breathless headlines isn’t closing. If anything, it’s widening as vendors compete for attention. The antidote is simple: Read the fine print, check the pattern, and trust expert assessment over marketing claims.












