Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning

15 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: unlearning, representation-engineering, language-models, biosecurity, cybersecurity, fine-tuning, robustness, adversarial-attacks, WMDP, AI-safety, selective-unlearning, neural-representations, evaluation-robustness
TL;DR: When we collapse general representations before computing unlearning updates, we prevent the disruption of general performance and make unlearning more robust.
Abstract: Current unlearning and safety training methods consistently fail to remove dangerous knowledge from language models. We identify the root cause – unlearning targets representations which are too general – and develop a highly selective technique that unlearns robustly while preserving general performance. Our method performs PCA on activations and module-output gradients to identify subspaces containing common representations, then collapses these subspaces before computing unlearning updates, a technique we term Collapse of Irrelevant Representations (CIR). This avoids unlearning general knowledge and targets only representations specific to the facts being unlearned. When unlearning bio- and cyber-hazardous facts from Llama-3.1-8B, we achieve over 30× greater reduction in post-attack accuracy than the best baseline (Circuit Breakers), while disrupting general performance 30× less, and using less than 3 GPU-seconds per fact. Thus, by disentangling harmful and benign capabilities at the level of representations, CIR enables robust and non-disruptive unlearning.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 6067
Loading