Textual Unlearning Gives a False Sense of Unlearning

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Our study highlights the vulnerabilities and privacy risks of machine unlearning in language models. We demonstrate that existing unlearning methods not only fail to completely remove the targeted texts but also expose more about them in deployment.
Abstract: Language Models (LMs) are prone to ''memorizing'' training data, including substantial sensitive user information. To mitigate privacy risks and safeguard the right to be forgotten, machine unlearning has emerged as a promising approach for enabling LMs to efficiently ''forget'' specific texts. However, despite the good intentions, is textual unlearning really as effective and reliable as expected? To address the concern, we first propose Unlearning Likelihood Ratio Attack+ (U-LiRA+), a rigorous textual unlearning auditing method, and find that unlearned texts can still be detected with very high confidence after unlearning. Further, we conduct an in-depth investigation on the privacy risks of textual unlearning mechanisms in deployment and present the Textual Unlearning Leakage Attack (TULA), along with its variants in both black- and white-box scenarios. We show that textual unlearning mechanisms could instead reveal more about the unlearned texts, exposing them to significant membership inference and data reconstruction risks. Our findings highlight that existing textual unlearning actually gives a false sense of unlearning, underscoring the need for more robust and secure unlearning mechanisms.
Lay Summary: Machine unlearning is a method designed to make artificial intelligence (AI) models "forget" specific pieces of information. This is especially important for protecting sensitive data and complying with privacy laws. However, our research shows that unlearning doesn't work as well as it seems in language models (the AI behind tools like chatbots). We found that current unlearning techniques often fail to fully erase the targeted information. Using a rigorous auditing approach, we were still able to detect traces of the supposedly forgotten data. Even more concerning, we discovered that trying to unlearn data can backfire: it can actually make it easier for attackers to figure out what you want to forget by comparing models before and after unlearning. Our work highlights serious flaws in current machine unlearning practices and emphasizes the need for safer, more reliable methods to truly protect user privacy in AI systems.
Primary Area: Social Aspects->Privacy
Keywords: machine unlearning; unlearning auditing; membership inference attacks; data reconstruction attacks
Submission Number: 8455
Loading