Keywords: LLM unlearning; LLM watermarking
TL;DR: Noisy But Forgotten: LLM Unlearning are Robust against Perturbed Data in the Wild
Abstract: Large language models (LLMs) demonstrate impressive generative capabilities but pose ethical and security risks by memorizing sensitive data, amplifying biases, and generating harmful content. These concerns motivate the study of LLM unlearning—the task of removing undesirable data-induced knowledge from pre-trained models. While existing methods often assume access to clean, well-defined forget datasets, real-world forget data is often low-quality, synthetically rewritten, or watermarked—raising concerns about the reliability of unlearning. This work presents the first systematic investigation into the impact of perturbed or low-fidelity forget data on unlearning performance. Through extensive experiments on the WMDP and MUSE benchmarks using state-of-the-art RMU and NPO unlearning algorithms, along with saliency-based analyses, we find that unlearning remains surprisingly robust to data perturbations, with core semantic elements often preserved. These findings underscore both the resilience of current unlearning algorithms and the critical importance of adopting a data-centric perspective when evaluating unlearning efficacy.
Submission Number: 44
Loading