Keywords: unlearning, representation-engineering, language-models, biosecurity, cybersecurity, fine-tuning, robustness, adversarial-attacks, WMDP, AI-safety, selective-unlearning, neural-representations, evaluation-robustness
TL;DR: When we collapse general representations before computing unlearning updates, we prevent the disruption of general performance and make unlearning more robust.
Abstract: Making large language models (LLMs) deliberately forget specific knowledge while preserving general capabilities remains a central challenge of machine unlearning. Despite progress, existing methods consistently fail at the goal: unlearned knowledge can be easily recovered with brief fine-tuning or with few-shot attacks. We identify an underlying cause: existing methods target representations shared across forget and retain data, making unlearning both disruptive to general capabilities and trivially reversible. We propose RepSelect (Representation Selectivity), which isolates representations specific to the forget set by collapsing the principal components of activations and output gradients before each update, leaving general capabilities intact and limiting what an attacker can recover. Across bio-hazardous knowledge and abusive tendencies (WMDP and BeaverTails) and four models spanning dense and Mixture-of-Experts architectures (Llama-3, Qwen-3.5, Gemma-4-E4B, DeepSeek-V2-Lite), RepSelect reduces post-relearning answer probability 4-50 times more than the best of five baselines (GradDiff, NPO, SimNPO, RMU, UNDIAL) under both fine-tuning and few-shot attacks, at a matched utility retention, demonstrating that selectively targeting representations is essential for robust LLM unlearning.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 87
Loading