Abstract: Large Language Models (LLMs) embed sensitive, human-generated data, prompting the need for unlearning methods. Although certified unlearning offers strong privacy guarantees, its restrictive assumptions make it unsuitable for LLMs, giving rise to various heuristic approaches typically assessed through empirical evaluations. These standard evaluations randomly select data for removal, apply unlearning techniques, and use membership inference attacks (MIAs) to compare unlearned models against models retrained without the removed data. However, to ensure robust privacy protections for every data point, it is essential to account for scenarios in which certain data subsets face elevated risks. Prior research suggests that outliers, particularly including data tied to minority groups, often exhibit higher memorization propensity which indicates they may be more difficult to unlearn. Building on these insights, we introduce a complementary, minority-aware evaluation framework to highlight blind spots in existing frameworks. We substantiate our findings with carefully designed experiments, using canaries with personally identifiable information (PII) to represent these minority subsets and demonstrate that they suffer at least 20\% higher privacy leakage across various unlearning methods, MIAs, datasets, and LLM scales. Our proposed minority-aware evaluation framework marks an essential step toward more equitable and comprehensive assessments of LLM unlearning efficacy.
Lay Summary: Large language models (LLMs) are trained on vast amounts of human-generated data, including sensitive information such as phone numbers, emails, and other personal details. When these models memorize such information, it can lead to privacy risks, especially for minority groups whose data may be more unique or less represented in the training set. Yet, existing methods for removing sensitive data from LLMs, called LLM unlearning, usually test on randomly selected data and overlook these high-risk cases.
To tackle this problem, we created a new way to evaluate unlearning methods that pays special attention to data from minority groups. We designed experiments using “canaries”, i.e., small, carefully chosen test cases with personal information like phone numbers, and found that across various unlearning techniques and models, data from minority groups consistently suffers about 20% more privacy leakage.
Our work highlights the importance of designing fairer and more robust privacy evaluations for LLMs. By sharing our code and framework, we hope to help researchers and developers build safer AI systems that offer stronger privacy protections for everyone, not just the majority.
Link To Code: https://github.com/Graph-COM/Minority_Aware_LLM_Unlearning.git
Primary Area: Social Aspects->Privacy
Keywords: Machine Unlearning, Minority Groups, Large Language Models
Submission Number: 12768
Loading