How copyright lawsuits could kill OpenAI
The New York Times v. OpenAI, explained.
If you’re old enough to remember watching the hit kid’s show Animaniacs, you probably remember Napster, too. The peer-to-peer file-sharing site, which made it easy to download music for free in an era before Spotify and Apple Music, took college campuses by storm in the late 1990s. This did not escape the notice of the record companies, and in 2001, a federal court ruled that Napster was liable for copyright infringement. The content producers fought back against the technology platform and won.
But that was 2001 — before the iPhone, before YouTube, and before generative AI. This generation’s big copyright battle is pitting journalists against artificially intelligent software that has learned from and can regurgitate their reporting.
Late last year, the New York Times sued OpenAI and Microsoft, alleging that the companies are stealing its copyrighted content to train their large language models and then profiting off of it. In a point-by-point rebuttal to the lawsuit’s accusations, OpenAI claimed no wrongdoing. Meanwhile, the Senate Judiciary Subcommittee on Privacy, Technology, and Law held a hearing in which news executives implored lawmakers to force AI companies to pay publishers for using their content.
Depending on who you ask, what’s at stake is either the future of the news business, the future of copyright law, the future of innovation, or, specifically, the future of OpenAI and other generative AI companies. Or all of the above.
Ideally, Congress would step in to settle the debate, but as James Grimmelmann, a professor of digital and information law at Cornell Law School, told me: “Congress does not like to legislate on copyright unless there’s a consensus of most of the players in the room — and there’s not anything resembling that consensus right now. So Congress may hold hearings and talk about it, but we’re really far from any legislative action.”
So which is it? Advocates of technological innovation would say that AI technology is full of promise and we’d better not stifle that while it’s in the early days of development. Media companies would say that even exciting technology companies need to pay when they use copyrighted content, and if we give AI a free pass, journalism as we know it could eventually cease to exist.
The consensus of casual observers and legal experts alike is that this New York Times lawsuit is a big deal. Not only does the Times appear to have a solid case, but OpenAI has a lot to lose — perhaps its very existence.
The case against OpenAI, briefly explained
If you ask ChatGPT a question about, say, the fall of the Berlin Wall, there’s a good chance some of the information in the answer has been culled from New York Times articles. That’s because the large language model, or LLM, that powers ChatGPT has been trained on over 500 gigabytes of data, including newspaper archives. Generative AI tools only work because this training data helps them know how to effectively respond to prompts. In other words, copyrighted data, in part, is what makes this new technology powerful and what makes OpenAI such a valuable company.
The New York Times claims that OpenAI trained its model with copyrighted Times content and did not pay proper licensing fees. That, the lawsuit says, enables OpenAI to “compete with and closely mimic” the New York Times, perhaps by summing up a news story based on Times reporting or summing up a product recommendation based on Wirecutter reviews.
Even worse is what the lawsuit calls “regurgitation,” which is when OpenAI spits out text that matches Times articles verbatim. The Times provides 100 examples of such “regurgitation” in the lawsuit. In its rebuttal, OpenAI said that regurgitation is a “rare bug” that the company is “working to drive to zero.” It also claims that the Times “intentionally manipulated prompts” to get this to happen and “cherry-picked their examples from many attempts.”
But at the end of the day, the New York Times argues that OpenAI is making money off of content and costing the newspaper “billions of dollars in statutory and actual damages.” By one estimate, given the millions of articles potentially implicated and the cost per instance of copying, the New York Times might be looking for $450 billion in damages.
OpenAI has a clear solution to this conflict: Pay the copyright owners upfront. The company has already announced licensing deals with folks like the Associated Press and Axel Springer. OpenAI also claims that it was negotiating a deal with the New York Times right before the newspaper filed its lawsuit.
Just how much OpenAI is willing to pay news outlets is unclear. A January 4 report in the Information said that OpenAI has offered some media firms “as little as between $1 million and $5 million to license their articles for use in training its large language models,” which seems like a small amount of money to OpenAI, currently aiming for a valuation as high as $100 billion. But the mounting lawsuits, should they go against the company, could be far more expensive than paying heftier licensing fees.
The New York Times is also not the only party suing OpenAI and other tech companies over copyright infringement. A growing list of authors and entertainers have been filing lawsuits since ChatGPT made its splashy debut in the fall of 2022, accusing these companies of copying their works in order to train their models. The copyright holders filing these lawsuits extend well beyond writers, too. Developers have sued OpenAI and Microsoft for allegedly stealing software code, while Getty Images is embroiled in a lawsuit against Stability AI, the makers of image-generating model Stable Diffusion, over its copyrighted photos.
“When you’re talking about copyright and you get statutory damages,” said Corynne McSherry, legal director at the Electronic Frontier Foundation, “if you lose, the downside and the financial risk is massive.”
The case for innovation
While it’s easy to compare the Times case to the Napster one, the better precedent involves the VCR, according to McSherry.
In 1984, a years-long copyright case between Sony and Universal Studios over the practice of using VCRs to record TV shows made it all the way to the United States Supreme Court. The studio alleged that Sony’s Betamax video tapes could be used for copyright infringement, while Sony’s lawyers argued that taping shows was fair use, which is the doctrine that allows copyrighted material to be reused without permission or payment.
Sony won. The judge’s decision, which has never been overturned, said that if machines, including the VCR, have non-infringing uses then the company that makes them can’t be held liable if customers use them to infringe upon copyrights.
The entertainment industry was forever changed by this case. The VCR let people watch whatever was broadcast on TV whenever they wanted, and in just a few years, Hollywood studios actually ended up seeing their profits grow in the VCR era. The machine got people more excited about watching movies, and they watched more of them, both at home and in theaters.
“If you have to go to copyright owners for permission for technological innovation, you’re going to get a lot less innovation,” McSherry told Vox.
That in mind, there’s one more copyright lawsuit worth looking at: the Google Books case. In 2004, Google started scanning books, including copyrighted works, so that “snippets” of their text would show up in search results. It partnered with libraries at places like Harvard, Stanford, and the University of Michigan, as well as magazines, like New York Magazine and Popular Mechanics, that wanted their archives digitized.
Then came the lawsuits, including a 2005 class action suit from the Authors Guild. The authors cried copyright infringement, and Google claimed that making books searchable amounted to fair use. As Judge Denny Chin said in a 2013 decision dismissing the authors’ lawsuit, Google Books is transformative because, thanks to the tool, “words in books are being used in a way they have not been used before.” It took about a decade, but Google eventually won, and Google Books is now legal.
Like Sony and Napster before it, the Google Books case is ultimately about the battle between new technology platforms and copyright holders. It also raises the question of innovation. Is it possible that giving copyright holders too much power could stifle technological progress?
In that 2013 decision, Judge Chin said its technology “advances the progress of the arts and sciences, while maintaining respectful consideration for the rights of authors and other creative individuals, and without adversely impacting the rights of copyright holders.” And a 2023 economics study of the effects of Google Books found that “digitization significantly boosts the demand for physical versions” and “allows independent publishers to introduce new editions for existing books, further increasing sales.” So consider that another point in favor of giving tech platforms room to innovate.
Few would disagree that technological progress has shaped the media business since the invention of the printing press. That’s basically why the earliest copyright laws were written over 300 years ago: Technology made copying easier, and authors needed some way to protect their intellectual property.
But AI is a bigger leap forward, technologically speaking, than the VCR, Napster, and Google Books combined. We don’t know yet, but AI seems destined to transform our understanding of copyright and how content creators get paid for their work. It will take a while, too. A ruling in the New York Times’s case against OpenAI will take years, and even then, questions will remain.
“I think generative AI could be as transformational for copyright as the printing press,” said Grimmelmann, the Cornell law professor. “But that will probably take a little bit longer to play out.”
A version of this story was also published in the Vox Technology newsletter. Sign up here so you don’t miss the next one!