Exploring Memorization and Copyright Violation in Frontier LLMs: A Study of the New York Times v. OpenAI 2023 Lawsuit
Keywords: memorization, copyright infringement, large language models
TL;DR: Despite the public attention on OpenAI models, other frontier LLMs exhibit significantly more memorization than OpenAI models.
Abstract: Copyright infringement in frontier large language models (LLMs) has received much attention recently due to the case NYT v. Microsoft, filed in December 2023. The New York Times claims that GPT-4 has infringed its copyrights by reproducing articles for use in LLM training and by memorizing the inputs and thereby publicly displaying them in LLM outputs. This research attempts to measure the propensity of OpenAI's LLM to exhibit verbatim memorization in its outputs relative to other LLMs, specifically focusing on news articles. LLMs operate on statistical patterns, indirectly "storing" information by learning the statistical distribution of text over a training corpus. We show that OpenAI models are currently less prone to the elicitation of memorization than either Meta, Anthropic, or Mistral. We also find that the bigger the model, the more memorization we can elicit, particularly for models with more than 100 billion parameters. Our findings have practical implications for training: more attention must be put on preventing verbatim memorization for bigger models. Our findings also have legal significance: in assessing the relative memorization capacity of OpenAI's LLM, we probe the strength of The New York Times' copyright infringement claims and OpenAI's legal defenses while underscoring issues at the intersection of generative artificial intelligence (AI) and law and policy more broadly.
Submission Number: 84
Loading