Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning

Published: 27 Oct 2025, Last Modified: 27 Oct 2025NeurIPS Lock-LLM Workshop 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: unlearning, representation-engineering, language-models, biosecurity, cybersecurity, fine-tuning, robustness, adversarial-attacks, WMDP, AI-safety, selective-unlearning, neural-representations, evaluation-robustness
TL;DR: When we collapse general representations before computing unlearning updates, we prevent the disruption of general performance and make unlearning more robust.
Abstract: Current unlearning techniques and safety training consistently fail to remove dangerous knowledge from language models. We analyze the root causes and propose a highly selective technique which unlearns robustly and without disrupting general performance. We perform PCA on activations and output gradients to identify subspaces containing common representations, and collapse them before calculating unlearning updates. This way we avoid unlearning general representations, and only target those specific to the unlearned facts. When unlearning WMDP dataset facts from Llama-3.1-8B, we drop post-attack accuracy 30x more than SOTA (Circuit Breakers) on biohazardous facts and 6x more on cyberhazardous facts. Despite this, we disrupt general performance 30x less, while requiring less than 3 GPU-seconds per fact.
Submission Number: 14
Loading