Preventing Memorized Completions through White-Box Filtering

Published: 05 Mar 2024, Last Modified: 08 May 2024ICLR 2024 R2-FM Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: memorization, model-internals, probing, copyright, privacy
TL;DR: Model internals are more predictive of memorization than the text alone. Probes beat text classifiers in generalization and sample efficiency.
Abstract: Large Language Models (LLM) generate text they've memorized during training, which can raise privacy and copyright concerns. For example, in a recent lawsuit from the New York Times against OpenAI, it was argued that GPT-4's verbatim memorization of NYT articles violated copyright laws \citet{nytlawsuit2023}. Current production systems moderate content through a combination of small text classifiers or string processing algorithms, which can have generalization failures. Recent work suggests that the internal activations of a model can contain rich descriptions of its computations. In this work, we show that probes can detect LLM regurgitation of memorized training data and outperform text classifiers in a wide array of generalization settings. Additionally, probes are more sample and parameter efficient. Finally, we create a filtering mechanism using a rejection sampling approach that can effectively mitigate memorized completions.
Submission Number: 43
Loading