Copyright and Generative AI: Best practices for LLM training and recent developments in U.S. litigation

Copyright and Generative AI: Best practices for LLM training and recent developments in U.S. litigation

Abstract

Generative AI based on large language models (LLMs) such as ChatGPT, DALL·E-2, Midjourney, Stable Diffusion, JukeBox, and MusicLM can produce text, images, and music that are indistinguishable from human-authored works. The training data for these large language models consists predominantly of copyrighted works. This presentation and the accompanying article explore how generative AI fits within U.S. fair use rulings established in relation to previous generations of copy-reliant technology, including software reverse engineering, automated plagiarism detection systems, and the text data mining at the heart of the landmark HathiTrust and Google Books cases.

Although there is no machine learning exception to the principle of non-expressive use, the largeness of likelihood models suggest that they are capable of memorizing and reconstituting works in the training data, something that is incompatible with non-expressive use. At the moment, memorization is an edge case. For the most part, the link between the training data and the output of generative AI is attenuated by a process of decomposition, abstraction, and remix. Generally, pseudo-expression generated by large language models does not infringe copyright because these models “learn” latent features and associations within the training data, they do not memorize snippets of original expression from individual works.

However, there are particular situations in the context of text-to-image models where memorization of the training data is more likely. The computer science literature suggests that memorization is more likely when: models are trained on many duplicates of the same work; images are associated with unique text descriptions; and the ratio of the size of the model to the training data is relatively large. Professor Sag will talk through examples where these problems are accentuated and outline his proposals for initial best practices for “Copyright Safety for Generative AI” to reduce the risk of copyright and related infringement.

About the Speaker

Matthew Sag is a Professor of Law in Artificial Intelligence, Machine Learning and Data Science at Emory University Law School. Professor Sag is an expert in copyright law and intellectual property. He is a leading U.S. authority on the fair use doctrine in copyright law and its implications for researchers in the fields of text data mining, machine learning, and AI.

He was born and educated in Australia and earned honors in Law at the Australian National University in Canberra and clerked for Justice Paul Finn at the Australian Federal Court. Sag practiced law London as an associate at Arnold & Porter, and in Silicon Valley with Skadden, Arps, Slate, Meagher & Flom. Prior to Emory, he taught at DePaul University and Loyola Chicago; he has also held visiting posts at Northwestern University, the University of Virginia and the University of Melbourne.

Sag is currently working on several theoretical contributions to copyright law in relation to AI and machine learning and a series of empirical papers using text-mining and machine learning tools to study judicial behavior. His work has been published in leading journals such as Nature, and the law reviews of the University of California Berkeley, Georgetown, Northwestern, Notre Dame, Vanderbilt, Iowa and William & Mary, among others. His research has been widely cited in academic works, court submissions, judicial opinions and government reports.

About the Moderator

Daniela Simone is an intellectual property law scholar with a special interest in copyright law and the challenges of the digital age. Daniela holds DPhil, MPhil and BCL degrees from the University of Oxford and a BA (English and French)/LLB (Hons I) degree from the University of Sydney. Daniela is a qualified lawyer and has worked at global commercial law firm, Ashurst.

Prior to joining Macquarie Law School, Daniela was Lecturer in Law and Co-Director of the Institute of Brand and Innovation Law at University College London. Daniela was founder of the University of Oxford’s Intellectual Property Discussion Group (and its convenor until 2013). She is a Fellow of the Higher Education Academy with extensive experience in course design and innovative, research-led teaching.

Daniela’s research explores the intersection of law, technology, and culture. She is interested in collaborative authorship, artificial intelligence, the disruption new technology has brought to copyright law, regulation of the internet, the interaction between law and social norms, the international IP system, philosophy of IP, and the regulation of cultural property. Her work embraces comparative and inter-disciplinary methods and she is keen to engage directly with stakeholders.

 

———————————

Time: 1.00- 2.00pm (arrivals are welcomed from 12.30pm to mingle and settle in with lunch)

Date: Thursday, 12 October 2023

Venue: In-person: Law Foyer, Level 2, New Law Building (F10), University of Sydney, Camperdown, Gadigal Land, NSW 2006 (please follow directional signage on arrival)

———————————

This event is proudly co-hosted by the University of Sydney Law School and the ARC Centre of Excellence for Automated Decision-Making and Society (ADM+S). Our moderator joins us from Macquarie Law School. 

Register now

Enquiries may be directed to: law.events@sydney.edu.au