CRU: Layer-Targeted Contrastive Representation Unlearning for Selective LLM Forgetting

ACL ARR 2026 January Submission5145 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Layer-Targeted Unlearning, Machine Unlearning for LLMs, Selective Knowledge Removal
Abstract: Machine unlearning for large language models (LLMs) aims to remove undesirable content without retraining from scratch, but target concepts are often distributed across layers, making edits either overly destructive or insufficiently effective. We propose Contrastive Representation Unlearning (CRU), a layer-targeted framework that edits only a small set of concept-bearing layers. CRU first localizes memory layers via activation-based significance on the forget set, then applies a compact representation-level objective that (1) anchors retain representations to the original model, (2) pushes forget representations toward sample-specific neutral targets, and (3) increases the margin between forget and retain representations. Experiments on WMDP and MUSE with three 7B LLMs show that CRU improves forgetting over strong global-editing baselines while largely preserving general utility on MMLU. These results suggest that precise layer localization and representation-level constraints enable efficient and reliable targeted unlearning.
Paper Type: Long
Research Area: Retrieval-Augmented Language Models
Research Area Keywords: Safety and Alignment in LLMs,Machine Unlearning,
Languages Studied: English
Submission Number: 5145
Loading