Abstract: Language models trained on web-scale corpora risk memorizing and exposing sensitive information, prompting the need for effective machine unlearning methods.
Prior methods along these lines, ranging from blocking sensitive input queries to modifying model parameters, often fail to prevent leakage in generated responses and risk unintentionally forgetting important general knowledge (i.e., catastrophic forgetting).
To address the limitations, we propose Corrective Unlearning with Retrieved Exclusions (CURE), a response-level unlearning framework that identifies and edits leaked content in model outputs without updating the original model.
Specifically, CURE employs a corrector that flags and revises unwanted content with unlearning contexts provided as in-context examples for leakage detection.
To efficiently handle large-scale unlearning requests, we integrate retrieval augmentation to dynamically select relevant unlearning samples based on the model's initial output, effectively reducing the context length required for correction.
Extensive evaluations show that CURE significantly reduces response-level leakage while preserving model utility, maintaining robust performance even under continual unlearning setups.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: security and privacy,safety and alignment,retrieval-augmented generation
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 7374
Loading