One in five peer reviews at ICLR 2026, a major AI research conference, were fully AI-generated. That’s 21% of the 75,800 reviews submitted—15,899 reviews written by algorithms instead of humans. The irony is inescapable: AI researchers, the people building tools to automate knowledge work, couldn’t resist using AI to automate their own peer review process. Then they got caught by AI detection systems. The call is coming from inside the house.
The Scale Is Staggering
The numbers reveal a systemic crisis, not isolated incidents. Pangram Labs analyzed all publicly available ICLR 2026 reviews and found 21% were fully AI-generated. Another 50% showed some AI involvement—editing, assistance, or partial generation. Meanwhile, ICML 2026 took enforcement action, desk-rejecting 497 papers—roughly 2% of all submissions—because their authors violated the conference’s no-LLM policy when reviewing others’ work. The conference detected 795 reviews written by 506 unique reviewers who promised not to use AI, then did anyway.
If 21% were detected, how many evaded detection entirely?
How They Got Caught
ICML embedded hidden watermarks in submission PDFs. If a reviewer fed the PDF to an LLM to generate a review, the watermark instructed the AI to include specific telltale phrases randomly selected from a dictionary of roughly 170,000 options. The system achieved over 80% success rates detecting frontier models in testing. But the conference acknowledged the limitation: watermarking “only catches the most egregious and careless uses” where reviewers directly copy-paste LLM output. Anyone aware of the watermark could easily circumvent it.
Pangram Labs used a different approach—analyzing linguistic and structural patterns across all ICLR reviews to identify AI-generated text. Their methodology, published as a preprint, enabled the discovery of the 21% figure that shocked the community.
What AI Reviews Look Like
AI-generated reviews have distinctive tells. They run significantly longer than human reviews. They use heavy section headers with bold formatting. They pack low information density—verbose but shallow. One extreme example: a 3,000-word review listing 40 weaknesses and 40 questions. Researchers also noticed hallucinated citations referencing papers that don’t exist, unusual requests for non-standard statistical analyses, and excessive bullet points with Markdown formatting preferences.
Verbose doesn’t mean thorough. It means algorithmic.
The Community Is Divided
AI researchers are “starkly divided” on whether LLMs should have any role in peer review. ICML responded with an unprecedented two-track system. Policy A bans all LLM use. Policy B allows limited use—LLMs can help reviewers understand papers and polish text, but not generate substantive evaluations. Authors choose which policy they require. Reviewers declare which policy they’ll follow. The system matches them accordingly. If you demand no-LLM reviews for your work, you must provide no-LLM reviews for others—a reciprocal requirement.
Arguments for AI assistance cite efficiency gains and overwhelming scale—major AI conferences now exceed 10,000 submissions per venue, and the reviewer pool isn’t growing proportionally. Arguments against cite confidentiality violations (uploading unpublished work to LLMs without consent), hallucinations producing fake citations, and an accountability gap when AI makes errors. One academic paper captured the paradox: “Human oversight requires performing the very tasks meant to be outsourced to AI in the first place.”
No consensus exists. The two-track system acknowledges this reality by letting both approaches coexist—for now.
Why This Matters Beyond Conferences
Peer review is the foundation of scientific credibility. If that fails, science fails. But this scandal also signals something broader. AI researchers—the people who build automation tools, who understand their capabilities and limitations better than anyone—couldn’t resist automating their own professional judgment tasks. They got caught, faced consequences, and the community remains divided on whether they did anything wrong.
This is the canary in the coal mine for every profession requiring expert judgment. If AI researchers can’t maintain boundaries with their own tools, what happens in law, medicine, engineering, journalism? The question isn’t whether AI will assist professional work. It’s whether we can draw defensible lines between assistance and replacement, between efficiency and integrity erosion.
The watermarking arms race has already begun. Better AI models will evade current detection. Better detection will emerge. Better evasion will follow. Meanwhile, 21% remains the number everyone remembers—the moment AI research had to confront its own automation.

