Preventing Memorized Completions through White-Box Filtering

ICLR 2024 Workshop SeT LLM Submission30 Authors

Published: 04 Mar 2024, Last Modified: 19 Apr 2024SeT LLM @ ICLR 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: memorization, model-internals, probing, copyright, privacy
TL;DR: Model internals are more predictive of memorization than the text alone. Probes beat text classifiers in generalization and sample efficiency.
Abstract: Large Language Models (LLM) generate text they've memorized during training, which can raise privacy and copyright concerns. For example, in a recent lawsuit from the New York Times against OpenAI, it was argued that GPT-4's verbatim memorization of NYT articles violated copyright laws \citet{nytlawsuit2023}. Current production systems moderate content through a combination of small text classifiers or string processing algorithms, which can have generalization failures. In this work, we show that the internal computations of a model provide an effective signal for memorization. Probes trained to detect LLM regurgitation of memorized training data are more sample-efficient, parameter-efficient, and generalize better than text classifiers. We package this into a rejection-sampling based filtering mechanism that can effectively mitigate memorized completions.
Submission Number: 30
Loading