Abstract: Large language models (LLMs) trained on extensive corpora risk memorizing sensitive, copyrighted, or toxic content.
To mitigate this, we propose OBLIVIATE, a robust and practical unlearning framework that can remove targeted data while preserving model utility.
It employs a structured process: extracting target tokens and building retain sets (from forget sets), followed by fine-tuning with a tailored loss decomposed into three components--mask, distillation, and world fact.
With low-rank adapters (LoRA), our approach ensures efficiency without compromising unlearning quality.
We evaluate OBLIVIATE across multiple datasets, including the Harry Potter series, WMDP, and TOFU, using a comprehensive suite of metrics: forget quality (with a new document-level memorization score), model utility, and fluency.
Results demonstrate its effectiveness in resisting membership inference attacks, minimizing impacts on retained data, and maintaining robustness across diverse scenarios.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: Unlearning, Large Language Models
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Submission Number: 2727
Loading