Keywords: LLM, memorization, extraction attack, membership inference, prompt optimization
Abstract: Training large language models (LLMs) on diverse datasets, including news, books, and user data, enhances their capabilities but also raises significant privacy and copyright concerns due to their capacity to memorize training data. Current memorization measurements, primarily based on extraction attacks like Discoverable Memorization, focus on an LLM’s ability to reproduce training data verbatim when prompted. While various extensions to these methods exist, allowing for different prompt forms and approximate matching, they introduce numerous parameters whose arbitrary selection significantly impacts reported memorization rates. This paper addresses the critical research question of how to compute the false positive rate (FPR) of these diverse memorization measurements. We propose a practical definition of FPR and ways to interpret them, offering a more principled approach to select an extraction attack and its parameters. Our findings reveal that while "stronger" extraction attacks often identify more memorized samples, they also tend to have higher FPRs. Notably, some computationally intensive methods exhibit lower extraction rates than simpler baselines when controlling for a fixed FPR.
Submission Number: 40
Loading