AGT\textsuperscript{AO}: Robust and Stabilized LLM Unlearning via Adversarial Gating Training with Adaptive Orthogonality

ACL ARR 2026 January Submission2148 Authors

02 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Machine Unlearning, Adversarial Gating Training, Adaptive Orthogonality, Catastrophic Forgetting, Superficial Forgetting
Abstract: While Large Language Models (LLMs) have achieved remarkable capabilities, they unintentionally memorize sensitive data, posing critical privacy and security risks. Machine unlearning is pivotal for mitigating these risks, yet existing paradigms face a fundamental dilemma: aggressive unlearning often induces catastrophic forgetting that degrades model utility, whereas conservative strategies risk superficial forgetting, leaving models vulnerable to adversarial recovery. To address this trade-off, we propose \textbf{AGT\textsuperscript{AO}} (Adversarial Gating Training with Adaptive Orthogonality), a unified framework designed to reconcile robust erasure with utility preservation. Specifically, our approach introduces \textbf{Adaptive Orthogonality (AO)} to dynamically mitigate geometric gradient conflicts between forgetting and retention objectives, thereby minimizing unintended knowledge degradation. Concurrently, \textbf{Adversarial Gating Training (AGT)} formulates unlearning as a latent-space min-max game, employing a curriculum-based gating mechanism to simulate and counter internal recovery attempts. Extensive experiments demonstrate that AGT\textsuperscript{AO} achieves a superior trade-off between unlearning efficacy (KUR $\approx$ 0.01) and model utility (MMLU 58.30). \footnote{Code is available at \url{https://anonymous.4open.science/r/AGT-unlearning}.}.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: Language Modeling, Ethics, Bias, and Fairness
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 2148
Loading