Improving LLM Unlearning Robustness via Random Perturbations

Dang Huu-Tien; Hoang Thanh-Tung; Anh Tuan Bui; Phuong Minh Nguyen; Le-Minh Nguyen; Naoya Inoue

Improving LLM Unlearning Robustness via Random Perturbations

Dang Huu-Tien, Hoang Thanh-Tung, Anh Tuan Bui, Phuong Minh Nguyen, Le-Minh Nguyen, Naoya Inoue

Published: 14 Apr 2026, Last Modified: 14 Apr 2026Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Here, we show that current LLM unlearning methods inherently reduce models' robustness, causing them to misbehave even when a single non-adversarial forget-token is present in the retain-query. Toward understanding underlying causes, we propose a novel theoretical framework that reframes the unlearning process as a backdoor attack and defense problem: we formulate how the forgetting process inadvertently learns to align forget-tokens (backdoor triggers) with the target-representations (target labels). As a result, forget-tokens act as backdoor triggers that, when activated in retain-queries, cause disruptions in unlearned models' behaviors, similar to successful backdoor attacks. The sense that, LLM unlearning methods themselves poison the model, make it more vulnerable to forget-tokens, and hide rather than erase target knowledge, describes their true mechanism. To mitigate the vulnerability caused by the forgetting process, we reinterpret the retaining process as a backdoor defense and propose Random Noise Augmentation (RNA), a lightweight, model and method-agnostic approach with theoretical guarantees for improving the robustness of unlearned models. Extensive experiments demonstrate that RNA significantly improves the robustness of unlearned models while preserving forget and retain performances. This backdoor attack-defense framework offers insights into the mechanism of unlearning that can shed light on future research directions for improving unlearning robustness.

Submission Type: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: In this revision, we have made the following main updates: (1) Clarified the implication of Assumption 1 and provided empirical support for Assumption 1 (response to Reviewer 97nx). (2) Clarified the meaning of our proposed framework (response to Reviewer 97nx). (3) Conducted additional experiments on the forget-robustness of RNA models (response to Reviewer 97nx and Reviewer SdC2). (4) Conducted additional experiments on baselines (weight decay and dropout), comparing them with RNA (response to Reviewer SdC2).

Code: https://github.com/RebelsNLU-jaist/llmu-robustness

Assigned Action Editor: ~Boyu_Wang3

Submission Number: 7289

Loading