Generative AI’s Illusory Case for Fair Use
Jacqueline C. Charlesworth | 27 Vand. J. Ent. & Tech. L. 323 (2025)
Pointing to Authors Guild, Inc. v. Google Inc., Authors Guild, Inc. v. HathiTrust, Sega Enterprises Ltd. v. Accolade, Inc. and other leading technology-driven fair use precedents, artificial intelligence (AI) companies and those who advocate for their interests claim that mass unauthorized reproduction of books, music, photographs, visual art, news articles, and other copyrighted works to train generative AI systems is a fair use of those works. Though acknowledging that works are copied without permission for the training process, the proponents of fair use maintain that an AI machine learns only uncopyrightable information about the works during that process. Once trained, they say, the model does not incorporate or make use of the content of the training works. As such, they contend, copying for the purposes of AI training is a fair use under US law.
This Article challenges the above narrative by examining generative AI training and functionality. Despite wide employment of anthropomorphic terms to describe their behavior, AI machines do not learn or reason as humans do. Instead, they employ an algorithmic process to store the works they are fed during the training process. They do not “know” anything independently of the works on which they are trained, so their output is a function of the copied materials.
More specifically, large language models (LLMs) are trained by breaking textual works down into small segments, or “tokens” (typically individual words or parts of words), and converting the tokens into vectors—numerical representations of the tokens and where they appear in relation to other tokens in the text. The training works do not vanish, as suggested, but instead are encoded, token by token, into the model and relied upon to generate output. AI image generators are trained somewhat differently through a “diffusion” process in which they learn to reconstruct particular training images in conjunction with associated descriptive text. Like an LLM, however, an AI image generator relies on encoded representations of training works to generate its output.
The exploitation of expressive content to produce new expressive content sharply distinguishes AI copying from the copying at issue in the technological fair use cases relied upon by AI’s fair use advocates. In these earlier cases, the determination of fair use turned on the fact that the alleged infringer was not seeking to capitalize on authors’ creative expression. This is exactly the opposite of generative AI.
The fair use argument for generative AI is further hampered by the propensity of models to generate infringing copies and derivatives of training works. In addition, some AI models rely on retrieval-augmented generation (RAG) technology to generate output. RAG searches out and copies materials from online sources to augment and respond to user queries (for example, regarding an event that postdates the training of the LLM). Here again, copyrighted materials are being copied by generative AI without permission in order to exploit their expressive content.
For these and other reasons, each of the four fair use factors of Section 107 of the Copyright Act weighs against AI’s claim of lawful use, especially when considered against the backdrop of a rapidly evolving market for licensed use of training materials.