These 183,000 Books Are Fueling the Biggest Fight in Publishing and Tech

Use our new search tool to see which authors have been used to train the machines.

Illustration by Joanne Imperio / The Atlantic. Source: Getty.

September 25, 2023

Editor’s note: This searchable database is part of The Atlantic’s series on Books3. You can read about the origins of the database here, and an analysis of what’s in it here.

This summer, I acquired a data set of more than 191,000 books that were used without permission to train generative-AI systems by Meta, Bloomberg, and others. I wrote in The Atlantic about how the data set, known as “Books3,” was based on a collection of pirated ebooks, most of them published in the past 20 years. Since then, I’ve done a deep analysis of what’s actually in the data set, which is now at the center of several lawsuits brought against Meta by writers such as Sarah Silverman, Michael Chabon, and Paul Tremblay, who claim that its use in training generative AI amounts to copyright infringement.

Since my article appeared, I’ve heard from several authors wanting to know if their work is in Books3. In almost all cases, the answer has been yes. These authors spent years thinking, researching, imagining, and writing, and had no idea that their books were being used to train machines that could one day replace them. Meanwhile, the people building and training these machines stand to profit enormously.

Reached for comment, a spokesperson for Meta did not directly answer questions about the use of pirated books to train LLaMA, the company’s generative-AI product. Instead, she pointed me to a court filing from last week related to the Silverman lawsuit, in which lawyers for Meta argue that the case should be dismissed in part because neither the LLaMA model nor its outputs are “substantially similar” to the authors’ books.

It may be beyond the scope of copyright law to address the harms being done to authors by generative AI, and the point remains that AI-training practices are secretive and fundamentally nonconsensual. Very few people understand exactly how these programs are developed, even as such initiatives threaten to upend the world as we know it. Books are stored in Books3 as large, unlabeled blocks of text. To identify their authors and titles, I extracted ISBNs from these blocks of text and looked them up in a book database. Of the 191,000 titles I identified, 183,000 have associated author information. You can use the search tool below to look up authors in this subset and see which of their titles are included.

Before you begin, please note several caveats: Some books appear multiple times, reflecting different editions, translations, abridgments, or annotations. Because of inconsistencies in the spelling of author names, the search may not return books that are, in fact, in Books3. It may also deliver a jumble of odd formatting: A query for Agatha Christie will also return books labeled Agatha Christie and Christie Agatha, for example. And because of possible errors in the book-identification process, which involves detecting an ISBN within the text of the books and using a book database to find their author and title, there is a very small chance of false positives.

Alex Reisner is a freelance writer, programmer, and technical consultant.

Sections

The Print Edition

These 183,000 Books Are Fueling the Biggest Fight in Publishing and Tech