Noisy But Forgotten: LLM Unlearning are Robust against Perturbed Data in the Wild

Changsheng Wang; Yihua Zhang; Jinghan Jia; Dennis Wei; Sijia Liu

Noisy But Forgotten: LLM Unlearning are Robust against Perturbed Data in the Wild

Changsheng Wang, Yihua Zhang, Jinghan Jia, Dennis Wei, Sijia Liu

Published: 11 Jun 2025, Last Modified: 11 Jun 2025MUGen @ ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM unlearning; LLM watermarking

TL;DR: Noisy But Forgotten: LLM Unlearning are Robust against Perturbed Data in the Wild

Abstract: Large language models (LLMs) demonstrate impressive generative capabilities but pose ethical and security risks by memorizing sensitive data, amplifying biases, and generating harmful content. These concerns motivate the study of LLM unlearning—the task of removing undesirable data-induced knowledge from pre-trained models. While existing methods often assume access to clean, well-defined forget datasets, real-world forget data is often low-quality, synthetically rewritten, or watermarked—raising concerns about the reliability of unlearning. This work presents the first systematic investigation into the impact of perturbed or low-fidelity forget data on unlearning performance. Through extensive experiments on the WMDP and MUSE benchmarks using state-of-the-art RMU and NPO unlearning algorithms, along with saliency-based analyses, we find that unlearning remains surprisingly robust to data perturbations, with core semantic elements often preserved. These findings underscore both the resilience of current unlearning algorithms and the critical importance of adopting a data-centric perspective when evaluating unlearning efficacy.

Submission Number: 44

Loading