Corrective Unlearning: Scalable and Robust Knowledge Removal via Output Correction

ACL ARR 2026 January Submission7862 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: unlearning, safety, language model
Abstract: Language models trained on web-scale data risk memorizing and exposing sensitive information, yet existing unlearning methods struggle to balance safety, utility, and scalability. Prior approaches based on fine-tuning or input guardrails often degrade model performance, remain vulnerable to indirect probing, and scale poorly to continual unlearning. We propose corrective unlearning, a novel paradigm that achieves effective and scalable unlearning through output correction. Our framework, CURE, employs a lightweight corrector to detect and rewrite potential leakage in initial model drafts, leveraging retrieved unlearning targets as negative in-context references. Extensive evaluations show that CURE substantially reduces information leakage, even under indirect queries where prior methods fail, while preserving response quality and model utility. Moreover, CURE remains robust in continual and out-of-distribution unlearning scenarios, making it practical for real-world deployment.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: safety and alignment,security and privacy
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Submission Number: 7862
Loading