Not All Wrong is Bad: Using Adversarial Examples for Unlearning

Ali Ebrahimpour-Boroojeny; Hari Sundaram; Varun Chandrasekaran

Not All Wrong is Bad: Using Adversarial Examples for Unlearning

Ali Ebrahimpour-Boroojeny, Hari Sundaram, Varun Chandrasekaran

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 spotlightposterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Machine unlearning, where users can request the deletion of a forget dataset, is becoming increasingly important because of numerous privacy regulations. Initial works on "exact'' unlearning (e.g., retraining) incur large computational overheads. However, while computationally inexpensive, "approximate'' methods have fallen short of reaching the effectiveness of exact unlearning: models produced fail to obtain comparable accuracy and prediction confidence on both the forget and test (i.e., unseen) dataset. Exploiting this observation, we propose a new unlearning method, Adversarial Machine UNlearning (AMUN), that outperforms prior state-of-the-art (SOTA) methods for image classification. AMUN lowers the confidence of the model on the forget samples by fine-tuning the model on their corresponding adversarial examples. Adversarial examples naturally belong to the distribution imposed by the model on the input space; fine-tuning the model on the adversarial examples closest to the corresponding forget samples (a) localizes the changes to the decision boundary of the model around each forget sample and (b) avoids drastic changes to the global behavior of the model, thereby preserving the model's accuracy on test samples. Using AMUN for unlearning a random 10% of CIFAR-10 samples, we observe that even SOTA membership inference attacks cannot do better than random guessing.

Lay Summary: The paper presents Adversarial Machine Unlearning (AMUN), a computationally efficient framework that enables trained models to expunge designated training instances without the expense of full retraining. For each data point slated for removal AMUN finds a proximate adversarial example—a deliberately misclassified input—and conducts a brief fine-tuning phase on these modified samples with their wrong labels. This procedure perturbs the model’s decision boundary only in the local vicinity of the targeted points, markedly reducing its confidence on them while preserving overall performance on the remaining data. Empirical evaluations demonstrate that, even under rigorous privacy attacks designed to detect traces of the forgotten data, the resulting model behaves comparably to one trained ab initio without those records, surpassing prior approximate-unlearning approaches. Another significant advantage of AMUN is that it is effective even when there is no access to the remaining samples.

Link To Code: https://github.com/Ali-E/AMUN

Primary Area: Social Aspects->Safety

Keywords: Machine Unlearning, Adversarial example, Fine-tuning

Submission Number: 13000

Loading