Google’s Aletheia AI Solves 6 Research Math Problems

Google DeepMind’s Aletheia AI agent autonomously solved 6 of 10 research-level mathematics problems in the FirstProof challenge, with expert mathematicians judging the solutions as “publishable after minor revisions.” Unlike previous AI math achievements that tackled competition problems or benchmarks, Aletheia worked on unseen open research questions—never posted online—and produced them completely autonomously, without human hints or intervention during solving. This is the first AI to cross from solving known problems to creating new mathematical knowledge.

FirstProof: Unseen Research Problems, Not Competition Math

The FirstProof challenge, released in February 2026, presented 10 unpublished research-level mathematical problems spanning diverse domains. These weren’t IMO competition problems with known solutions—they were sourced from ongoing mathematical work, never posted online, preventing data contamination. Aletheia solved Problems 2, 5, 7, 8, 9, and 10 fully autonomously within the allowed timeframe, with zero human intervention during solving runs.

Expert evaluators judged the solutions as meeting “the established rigor of mathematical literature” and “publishable after minor revisions.” They were unanimous on five solutions; only Problem 8 drew mixed opinions. For comparison, OpenAI’s unreleased reasoning model initially claimed 6 solutions but revised down to 5 after finding logical flaws—and crucially, it required human supervision to select outputs. Aletheia, however, operated fully autonomously.

How It Works: Admitting Failure to Prevent Hallucination

Aletheia uses a multi-agent iterative approach powered by Gemini 3 Deep Think. Three specialized components work together: a Generator proposes logical steps and candidate solutions, a Verifier identifies flaws using natural language assessment, and a Reviser patches mistakes. The system loops through generate-verify-revise cycles until it reaches a solution—or admits defeat.

The critical feature: Aletheia can “admit failure.” When unable to solve a problem, it explicitly states “No solution found” rather than hallucinating false proofs. According to DeepMind researchers, reliability is “the primary bottleneck to scaling up AI assistance on research mathematics.” The system also integrates Google Search and web browsing to verify concepts against published literature, preventing spurious citations—a problem that plagues LLMs generating academic work.

For developers, this agentic architecture offers lessons beyond mathematics. The generate-verify-revise loop mirrors how humans debug code, and the “admit failure” principle prevents the false confidence that causes production bugs. It’s a template for building reliable AI systems.

The Results: Erdős Problems and Autonomous Papers

Beyond FirstProof, Aletheia tackled Bloom’s Erdős Conjectures database—700+ unsolved problems posed by legendary mathematician Paul Erdős. It autonomously solved 4 open questions: Erdős-652, 654, 1040, and 1051. The Erdős-1051 solution led to a generalization in peer-reviewed research. These are real unsolved problems, not academic exercises—solving even one is a meaningful mathematical contribution.

Most remarkably, Aletheia generated a complete research paper on eigenweights in arithmetic geometry with zero human intervention on mathematical content. It’s the first AI-generated research paper created autonomously, crossing from “tool” to “co-author.”

The reality check: while Aletheia achieved a 60% success rate on FirstProof’s 10 problems, only 6.5% of its solutions were meaningfully correct across all 700 Erdős problems. Most attempts still fail. This honest assessment—showing both promise and significant limitations—builds credibility.

Terence Tao: “AI Has Become My Junior Co-Author”

Terence Tao, Fields Medal winner and one of the world’s most prominent mathematicians, calls AI his “junior co-author” and has set up a community wiki to publicly track AI-assisted progress on Erdős problems. He advised on the Aletheia research, providing comments and transparency guidance.

When a Fields Medal winner embraces AI as a collaborator rather than a threat, it validates the technology’s maturity. The “junior co-author” framing is perfect: AI assists but doesn’t replace human mathematicians. Expert validation remains essential—solutions are judged “publishable after minor revisions” but still need peer review. This is the collaboration model for knowledge work, not the replacement narrative.

Key Takeaways

First AI to produce publishable mathematical research autonomously—crossing from calculator to researcher
60% success rate on FirstProof (6 of 10 research problems), expert-validated as publishable quality
Generator/Verifier/Reviser architecture prevents hallucination by admitting failure instead of forcing false answers
Solved 4 open Erdős problems and generated complete research paper with zero human input on math content
Significant limitations remain: 6.5% success on broader Erdős set, human validation still essential
Terence Tao calls AI “junior co-author”—collaboration model, not replacement

ByteBot

I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.