
Sakana AI’s AI Scientist-v2 system generated the first fully autonomous research paper to pass peer review, scoring 6.33/10 at an ICLR 2025 workshop. The system handled everything—hypothesis generation, experiment design, coding, data analysis, and manuscript writing—without human modification, costing just $15 to produce. Published in Nature in March 2026, this demonstrates AI can complete complex multi-step cognitive tasks that were exclusively human domains. But here’s the critical context: this was a workshop paper with a 60-70% acceptance rate, not a main conference submission (20-30% acceptance), and Sakana’s own team admits it wouldn’t meet their bar for top-tier publication.
What the System Actually Did
The AI Scientist-v2 didn’t just write text. It generated the paper “Compositional Regularization: Unexpected Obstacles in Enhancing Neural Network Generalization” from scratch. That means proposing the hypothesis, designing experiments to test it, writing the code to run those experiments, analyzing the results, creating visualizations, and writing the manuscript in LaTeX. Researchers gave it a broad research topic and selected which of three generated papers to submit. Everything else was autonomous.
The technical innovation behind this is agentic tree search—specifically, progressive breadth-first tree search. The system explores multiple research directions in parallel, much like human scientists do when investigating a problem. It’s backed by a vision-language model that iteratively refines figures and an automated reviewer component that matched 69% accuracy against human peer reviewer agreement from NeurIPS 2021. The whole process costs about $15 per paper, which is where things get interesting (or concerning, depending on your perspective).
The Reality Check No One Wants to Hear
Workshop papers and main conference papers aren’t the same. Workshop acceptance rates hover around 60-70%. Main conference tracks? 20-30%. That’s a substantial difference in selectivity. The AI Scientist paper scored 6, 7, and 6 from its three reviewers—an average of 6.33, landing it around the 45th percentile. It passed, but barely.
More importantly, Sakana AI themselves said none of their three submitted papers “passed our internal bar for what we believe would qualify as accepted ICLR conference track paper.” The accepted paper made citation errors, including misattributing LSTM development to the wrong researchers. The system struggles with “deep methodological rigor and complex code implementation” and produces “occasional naive or underdeveloped ideas.” The paper was withdrawn before publication, though that was planned from the start for transparency reasons.
This isn’t “AI can do science as well as humans.” This is “AI can produce work that passes a workshop-level bar, with known limitations and errors that humans catch on review.”
Why Developers Should Care
Strip away the academic context and look at what this system demonstrates: end-to-end automation of complex cognitive workflows. The parallels to software development are obvious. Automated testing follows the same pipeline—hypothesis about what might break, test design, code generation, execution, analysis, reporting. Documentation generation works similarly. Bug analysis, performance optimization, security auditing—all multi-step reasoning tasks that mirror what the AI Scientist does for research.
The key insight from the Nature paper is that “as the underlying foundation models improve, the quality of the generated papers increases correspondingly.” This isn’t a ceiling—it’s a floor. GPT-5 or Claude Opus 5 won’t just write better papers; they’ll write better tests, better documentation, better analysis. Berkeley Lab is already using similar AI systems for experimental design and analysis, freeing researchers to focus on discovery rather than execution.
For literature reviews, LLMs can synthesize thousands of research papers where humans cap out at hundreds. For code review, they can check thousands of edge cases where humans miss obvious bugs. The question isn’t whether AI can match human performance—it’s how quickly it surpasses the average.
The Ethics Debate Gets Messy
The Committee on Publication Ethics is clear: AI can’t be authors. It can’t initiate original research and isn’t accountable for published work. 59% of medical journals outright prohibit AI authorship; 32% allow it with restrictions. But when AI generates the entire paper autonomously, who gets credit? Who’s responsible when it hallucinates citations or produces statistically weak findings dressed up as discovery?
At $15 per paper, the system could flood peer review with volume. Nature’s editorial warns of “straining peer review, inflating credentials, borrowing ideas without credit.” The ease of generation could skew science toward data-rich computational domains while reducing diversity in research approaches. Early-career researchers might find fewer opportunities if AI handles the grunt work of experimentation.
Then there’s trust. AI produces convincing nonsense—fabricated citations, cherry-picked results, conclusions that sound rigorous but aren’t. Human verification remains essential, but as AI quality improves, detecting errors gets harder.
What Comes Next
Google, OpenAI, and Anthropic are all testing AI research assistants, though their outputs remain limited. More AI-generated papers will hit workshops and conferences. Publishers are scrambling to develop authorship policies. The likely outcome isn’t full automation but hybrid human-AI research, where humans direct and AI executes.
The open question: when AI surpasses the average human researcher, what happens to the profession? For developers, the same question applies. When AI handles end-to-end problem solving—from requirement analysis to implementation to testing to documentation—the role shifts from doing to directing. That’s coming faster than most expect.












