Keywords: Membership Inference, Dataset Inference, LLM Defenses
Abstract: There has been increasing legal interest in identifying possible copyright violations committed by Large Language Model (LLM) trainers. Many works developing individual membership inference attacks have recently been shown to be weaker than previously thought, due to implicit distributional shifts. To combat this, progress has been made by considering large datasets and aggregating multiple different membership attacks. A hidden assumption in these methods is that if the LLM has improperly used a dataset, the LLM was trained on that exact dataset. By challenging this assumption, we demonstrate, to our knowledge, the first failure of any large-scale dataset inference (DI) attack. In particular, we study LLM fine-tuning for both short and long text datasets. We adaptively transform the datasets before fine-tuning, enabling an increase in model performance while avoiding dataset inference. In the case of long texts, we find that text summarization followed by rephrasing substantially reduces the success probability of DI in our setting from over 95\% to less than 5\%. We also develop a new theoretical formulation of dataset inference specifically tailored to LLMs, which explains the effectiveness of our method and sheds light on how parameters, such as the number of training epochs, can affect dataset inference.
Supplementary Material: pdf
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 22546
Loading