
Copyright and Generative AI: Best practices for LLM training and recent developments in U.S. litigation
Copyright and Generative AI: Best practices for LLM training and recent developments in U.S. litigation
Abstract
Generative AI based on large language models (LLMs) such as ChatGPT, DALL·E-2, Midjourney, Stable Diffusion, JukeBox, and MusicLM can produce text, images, and music that are indistinguishable from human-authored works. The training data for these large language models consists predominantly of copyrighted works. This presentation and the accompanying article explore how generative AI fits within U.S. fair use rulings established in relation to previous generations of copy-reliant technology, including software reverse engineering, automated plagiarism detection systems, and the text data mining at the heart of the landmark HathiTrust and Google Books cases.
Although there is no machine learning exception to the principle of non-expressive use, the largeness of likelihood models suggest that they are capable of memorizing and reconstituting works in the training data, something that is incompatible with non-expressive use. At the moment, memorization is an edge case. For the most part, the link between the training data and the output of generative AI is attenuated by a process of decomposition, abstraction, and remix. Generally, pseudo-expression generated by large language models does not infringe copyright because these models “learn” latent features and associations within the training data, they do not memorize snippets of original expression from individual works.
However, there are particular situations in the context of text-to-image models where memorization of the training data is more likely. The computer science literature suggests that memorization is more likely when: models are trained on many duplicates of the same work; images are associated with unique text descriptions; and the ratio of the size of the model to the training data is relatively large. Professor Sag will talk through examples where these problems are accentuated and outline his proposals for initial best practices for “Copyright Safety for Generative AI” to reduce the risk of copyright and related infringement.
About the Speaker
Matthew Sag is a Professor of Law in Artificial Intelligence, Machine Learning and Data Science at Emory University Law School. Professor Sag is an expert in copyright law and intellectual property. He is a leading U.S. authority on the fair use doctrine in copyright law and its implications for researchers in the fields of text data mining, machine learning, and AI.
This event is proudly co-hosted by the University of Sydney Law School and the ARC Centre of Excellence for Automated Decision-Making and Society (ADM+S).
———————————
Time: 1.00- 2.00pm (light lunch served at 12.45pm)
Date: Thursday, 12 October 2023
Venue: In-person: Common Room, Level 4, New Law Building (F10), University of Sydney, Camperdown, Gadigal Land, NSW 2006 (please follow directional signage on arrival)
———————————
Enquiries may be directed to: law.events@sydney.edu.au