Keywords: LLM, memorisation, input perturbation
TL;DR: This paper presents a novel approach to identify LLM memorization instances via analysis of input perturbation's impact.
Abstract: While LLMs achieve remarkable performance through training on massive datasets, they often exhibit concerning behaviors such as verbatim reproduction of training data rather than true generalization. This memorization phenomenon raises critical concerns regarding data privacy, intellectual property rights, and the reliability of model evaluations. This paper introduces PEARL, a novel approach for detecting memorization in LLMs. PEARL assesses the sensitivity of an LLM’s performance to input perturbations, enabling detection of memorization in a black-box setting without requiring access to the model’s internal parameters or architecture. Furthermore, we investigate how input perturbations affect output consistency, allowing us to distinguish between true generalization and memorization. Extensive experiments on Pythia models demonstrate that our framework robustly identifies cases where models regurgitate learned information. Applied to GPT-4o, PEARL not only detected memorization of classic texts (e.g., the Bible) and common code snippets from HumanEval but also provided evidence suggesting that certain data sources, such as New York Times articles, were likely included in the model’s training set.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 21610
Loading