Lock on Target! Precision Unlearning via Directional Control

ACL ARR 2025 May Submission6540 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The unlearning method aims at effectively removing harmful, sensitive, or outdated knowledge without costly retraining the model. However, existing methods suffer from two critical limitations: (1) **collateral forgetting**, where erasing target data inadvertently removes related but desirable knowledge, and (2) **generality forgetting**, where aggressive unlearning degrades the model's general capabilities. To address these challenges, we propose **D**irecti**O**n **G**uide unl**E**arning (DOGE), a novel method that enables precise knowledge erasure by identifying and leveraging a targeted "unlearning direction" in the model’s parameter space. DOGE first extracts this direction through differential analysis of representations for forgotten and retained samples, pinpointing the exact subspace associated with unwanted knowledge. It then selectively applies updates along this direction, ensuring minimal interference with retained information and general model performance. Experiments across multiple benchmarks demonstrate that Doge achieves state-of-the-art unlearning precision while preserving both related knowledge and general capabilities.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: robustness
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study
Languages Studied: English
Submission Number: 6540
Loading