Invariance Makes LLM Unlearning Resilient Even to Unanticipated Downstream Fine-Tuning

Changsheng Wang; Yihua Zhang; Jinghan Jia; Parikshit Ram; Dennis Wei; Yuguang Yao; Soumyadeep Pal; Nathalie Baracaldo; Sijia Liu

Invariance Makes LLM Unlearning Resilient Even to Unanticipated Downstream Fine-Tuning

Changsheng Wang, Yihua Zhang, Jinghan Jia, Parikshit Ram, Dennis Wei, Yuguang Yao, Soumyadeep Pal, Nathalie Baracaldo, Sijia Liu

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Machine unlearning presents a promising approach to mitigating privacy and safety concerns in large language models (LLMs) by enabling the selective removal of targeted data or knowledge while preserving model utility. However, existing unlearning methods remain over-sensitive to downstream fine-tuning, which can rapidly recover what is supposed to be unlearned information even when the fine-tuning task is entirely unrelated to the unlearning objective. To enhance robustness, we introduce the concept of `invariance' into unlearning for the first time from the perspective of invariant risk minimization (IRM), a principle for environment-agnostic training. By leveraging IRM, we develop a new invariance-regularized LLM unlearning framework, termed invariant LLM unlearning (ILU). We show that the proposed invariance regularization, even using only a single fine-tuning dataset during ILU training, can enable unlearning robustness to generalize effectively across diverse and new fine-tuning tasks at test time. A task vector analysis is also provided to further elucidate the rationale behind ILU's effectiveness. Extensive experiments on the WMDP benchmark, which focuses on removing an LLM's hazardous knowledge generation capabilities, reveal that ILU significantly outperforms state-of-the-art unlearning methods, including negative preference optimization (NPO) and representation misdirection for unlearning (RMU). Notably, ILU achieves superior unlearning robustness across diverse downstream fine-tuning scenarios (e.g., math, paraphrase detection, and sentiment analysis) while preserving the fine-tuning performance.

Lay Summary: (1) Large language models (LLMs) can memorize sensitive or unsafe information, and safely removing this knowledge—known as machine unlearning—is crucial but difficult, especially when downstream training interferes with forgetting. (2) We introduce a new method called invariant LLM unlearning (ILU), which uses a regularization strategy inspired by invariant risk minimization to make forgetting more reliable and robust, even across unrelated tasks. (3) This approach helps build safer AI systems by ensuring sensitive knowledge is thoroughly removed while preserving the model’s performance on useful tasks.

Link To Code: https://github.com/OPTML-Group/Unlearn-ILU

Primary Area: Deep Learning->Large Language Models

Keywords: Unlearning; LLM; Robustness; Fine-tuning attack;

Submission Number: 8162

Loading