Keywords: LLMs, Unlearning
TL;DR: We develop a new unlearning method that is robust to benign relearning
Abstract: Unlearning in LLMs is about precisely removing specific information to improve safety by deleting private or harmful data. We argue that many current methods underperform due to problematic loss functions used in extended fine-tuning. To address this, we introduce JensUn, a method that uses the Jensen-Shannon Divergence as its training objective. By setting specific targets for the "forget" and "retain" sets, JensUn achieves a better unlearning quality-utility trade-off and is more resilient to benign relearning. For a more precise evaluation, we create LKF, a new dataset of lesser-known facts that are known to the pre-trained model. Building on this, we propose a more robust evaluation framework: \textit{(i)} using an LLM as a semantic judge instead of the standard ROUGE score, and \textit{(ii)} conducting a worst-case unlearning evaluation using various paraphrased queries and input formats. Our findings, based on this improved framework, show that many existing methods are less effective than previously believed.
Serve As Reviewer: ~Maximilian_Müller1
Submission Number: 41
Loading