Pulitzer Prize finalist John Carreyrou and five other authors filed lawsuits on December 23, 2025, against Anthropic, Google, OpenAI, Meta, xAI, and Perplexity, claiming the AI companies trained their models on millions of pirated books downloaded from shadow libraries. This marks the 50th+ copyright lawsuit filed against AI companies in 2025 – part of what TechCrunch called the year “AI got a vibe check.” Each author is seeking $150,000 per work from each defendant under the Copyright Act, rejecting a previous $1.5 billion Anthropic settlement that would have paid just $3,000 per book. The central question: When does “transformative innovation” cross the line into mass theft?
The Piracy Paradox
The allegations read like a script from the Napster era, except the teenagers downloading pirated content are now billion-dollar AI companies. According to the lawsuits, AI companies downloaded books from shadow libraries including LibGen (Library Genesis), Z-Library, and Anna’s Archive to train their models. The scale is staggering: Meta alone downloaded 35.7 terabytes of pirated data – over 31 million books. The infamous Books3 dataset, which contained 191,000+ pirated books, was used to train Meta’s Llama models. Meta admitted in court to using this dataset, though it claims fair use protections apply.
OpenAI’s models tell a similar story. GPT-1 trained on BookCorpus, a collection of 7,000+ unpublished titles scraped from Smashwords. For GPT-3, roughly 16 percent of training data came from “Books1” and “Books2” internet-based books corpora. DeepSeek’s Vision-Language model pulled data from Anna’s Archive. Nvidia faced lawsuits over creating the Books3 dataset for its NeMo AI platform.
The pattern is consistent: download pirated books, train AI models, build billion-dollar products, provide zero compensation to authors. Unsealed emails in the Meta case revealed employees knowingly used shadow libraries. These are the same companies that enforce strict API terms of service and take legal action when others scrape their platforms without permission.
Legal Uncertainty: Three Judges, Three Different Answers
The most concerning aspect of this legal battle is that nobody – not even federal judges – agrees on what’s actually legal. Three major 2025 rulings reached three different conclusions.
In Bartz v. Anthropic, Judge William Alsup ruled that using copyrighted materials to train LLMs was “transformative—spectacularly so” and constitutes fair use. But he drew a sharp line: training on legally acquired books might be fair use, but acquiring pirated books is “inherently, irredeemably infringing” regardless of how they’re ultimately used. Anthropic settled for $1.5 billion, paying authors an average of $3,000 per book.
Judge Vince Chhabria reached the opposite conclusion in Kadrey v. Meta. He ruled that training an LLM on copyrighted books – including books obtained from shadow libraries – was fair use. Unlike Judge Alsup, Chhabria refused to separate shadow library downloads as a distinct act of piracy.
A third ruling in Thomson Reuters v. Ross Intelligence added another wrinkle. Judge Stephanos Bibas ruled against the fair use defense, finding that Ross’s use was commercial and not transformative because it created a competing legal research product. Using training data to build a product that competes with the original source doesn’t qualify as transformative use.
The U.S. Copyright Office weighed in, rejecting arguments that AI training is “inherently transformative” or “analogous to human learning” – calling both claims “mistaken.” The conflicting rulings suggest this issue may require Supreme Court intervention to resolve.
Why Authors Are Suing Individually
The math explains why Carreyrou and others opted out of the Anthropic class action settlement. Under that settlement, authors would receive approximately $3,000 per book – just 2 percent of the Copyright Act’s statutory ceiling of $150,000. By suing individually, each author can seek $150,000 per work from each of the six defendants, potentially totaling $900,000 per work. That’s a 300x multiplier compared to the class action payout.
The plaintiffs include John Carreyrou (Bad Blood), Lisa Barretta (11 books on spirituality), Philip Shishkin (Restless Valley), Jane Adams (8 nonfiction books on family relationships), Matthew Sack (Pro Website Development and Operations), and Michael Kochin (political science professor). Their complaint argues that class settlements allow tech firms to “resolve mass infringement claims too cheaply.”
This lawsuit also marks the first copyright action against Elon Musk’s xAI over its training process and the first suit brought by authors against Perplexity AI. Both companies lack settlement precedents, making them potentially more lucrative targets for individual suits. If these authors win, expect a flood of individual lawsuits from other authors who see better returns fighting alone than accepting group settlements.
What This Means for Developers
The tools developers use daily were potentially built on pirated content. ChatGPT, Claude, GitHub Copilot, Gemini, Llama, and Grok are all either defendants in this lawsuit or products of companies that have admitted using shadow library data. Anthropic’s $1.5 billion settlement already established that Claude was trained on pirated books.
If AI companies lose these lawsuits decisively, the consequences cascade. Retraining models on fully licensed data would cost billions. Those costs get passed to enterprise customers and individual developers through higher subscription prices. Some features might disappear entirely if the underlying training data can’t be legally justified. Model capabilities could degrade if companies must excise knowledge derived from unauthorized sources.
The emerging $2.5 billion licensing market for training data suggests AI companies know their fair use defense is shaky. OpenAI, Google, and others have signed multimillion-dollar licensing deals with Reddit, the Wall Street Journal, and other content providers. They’re hedging their bets – arguing fair use in court while simultaneously paying for licenses.
There’s also a hypocrisy worth noting. Many developers who loudly oppose companies scraping GitHub without permission are now defending AI companies scraping books without permission. The principle should be consistent: either respect intellectual property rights or don’t, but you can’t have it both ways depending on whose ox is being gored.
The 2025 AI Vibe Check
This lawsuit is part of a broader 2025 trend where AI companies face reality checks despite astronomical valuations. OpenAI raised $40 billion at a $300 billion valuation in the same year that 50+ copyright lawsuits were filed against AI companies. Reports of “AI psychosis” sparked calls for trust and safety reforms. The narrative shifted from “AI will change everything” in early 2025 to a more measured “maybe we should figure out the legal and ethical frameworks first” by year’s end.
The industry’s actions reveal its true beliefs. If AI companies genuinely believed training on pirated content was unquestionably fair use, they wouldn’t be spending billions licensing data. The fact that a $2.5 billion licensing market exists while companies argue in court that they don’t need licenses tells you everything. They’re not confident in their legal position.
Whether AI training qualifies as fair use remains genuinely unsettled. What’s clear is that obtaining books through piracy doesn’t become legal just because the end use might be transformative. You can’t claim an innovation exception if you stole the input. Federal judges can’t agree on where to draw the line, which means developers and AI companies are operating in a legal gray zone that could collapse at any moment. Until the Supreme Court or Congress provides clarity, every AI model trained on shadow library content carries legal risk – and that risk now has a price tag starting at $150,000 per work per defendant.











