White-Box Auditing of Large Language Model Unlearning

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Machine Unlearning, Large Language Model
Abstract: Large language models (LLMs) can memorize sensitive information, raising serious privacy concerns. Machine unlearning offers a potential solution to remove such information, but it remains unclear whether existing methods truly erase it or merely hide it within the model. A key challenge is quantifying the persistence of sensitive data under a unified evaluation framework. To address this, we construct a synthetic dataset containing fake personal information and propose a white-box auditing framework to rigorously assess whether claimed-forgotten information is genuinely removed. Using this framework, we evaluate five existing unlearning methods and find that a simple “inverse greedy” decoding—selecting the least likely token at each step—can recover supposedly forgotten personal information. Our results reveal that current unlearning approaches often fail to fully eliminate sensitive information, highlighting the need for more reliable methods to ensure privacy in deployed LLMs.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 14911
Loading