Keywords: Membership inference attack, Pretraining Data Detection
Abstract: Although large language models (LMs) are widely deployed, the data used to train them is rarely disclosed. Given the incredible scale of this data, up to trillions of tokens, it is all but certain that it inadvertently includes potentially problematic text such as copyrighted materials, personally identifiable information, and test data for widely reported reference benchmarks. However, we currently have no way to know which data of these types is included or in what proportions. In this paper, we study the pretraining data detection problem; given a piece of text and black-box access to an LM with no knowledge of its training data, can we determine if the model was trained on our text. To study this problem, we introduce a dynamic benchmark WIKIMIA and a new detection method MIN-K PROB. Our method is based on a simple hypothesis: an unseen example is likely to contain a few outlier words with low probabilities under the LM, while a seen example is less likely to have words with such low probabilities. MIN-K PROB can be applied without any knowledge about the pretrainig corpus or any additional training, departing from previous detection methods that require training a reference model on data that is similar to the pretraining data. Moreover, our experiments demonstrate that MIN-K PROB achieves a 7.4% improvement over these previous methods. Our analysis demonstrates that MIN-K PROB is an effective tool for detecting contaminated benchmark data and copyrighted content within LMs.
Submission Number: 46
Loading