Keywords: Machine Unlearning, LLM Unlearning
TL;DR: We propose ERASER, a principled unlearning framework that adaptively removes harmful knowledge via subspace nullification and representational disentanglement, achieving state-of-the-art forgetting performance while preserving general capabilities.
Abstract: Large Language Models often retain sensitive or hazardous knowledge that must be suppressed without compromising their general linguistic abilities. However, existing unlearning methods are often unstable, sensitive to hyperparameter choices, and fail to generalize across knowledge types.
We introduce ERASER, a principled framework for targeted unlearning. It combines a subspace-based target construction with an auxiliary ranking objective that enforces separation between forget and retain domains, thereby achieving stable and effective unlearning.
Beyond existing evaluations, we conduct thorough experiments as follows:
(i) ERASER achieves state-of-the-art unlearning effectiveness while preserving general knowledge on existing benchmark datasets,
(ii) it removes knowledge not only at the surface level but also at deeper semantic and compositional levels using the Fictional Knowledge dataset,
and (iii) it demonstrates strong robustness against adversarial threats, including jailbreak, membership inference, and relearning attacks.
These results establish ERASER as a practical framework for safe LLM unlearning.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 18887
Loading