Keywords: LLM Honesty, LLM Unlearning, Trustworthy AI
TL;DR: We define the honesty of LLM unlearning, build a benchmark finding that most of methods forget dishonestly and analyze the reasons.
Abstract: Unlearning in large language models (LLMs) is a critical challenge for ensuring safety and controllability, aiming to remove undesirable data influences from pretrained models while retaining their overall utility. However, existing methods and benchmarks mainly focus on forget effectiveness, robustness and utility, while largely overlooking the honesty of unlearned models. Building on the literature surrounding LLM honesty, we define three key criteria that an honestly unlearned model must satisfy: (1) preserving both utility and honesty on retained knowledge, and (2) ensuring effective forgetting while encouraging the model to acknowledge its limitations and respond consistently to questions related to forgotten knowledge. To systematically evaluate the honesty of unlearning, we introduce a suite of metrics that cover utility, honesty on the retained set, effectiveness of forgetting, rejection rate and refusal stability in Q\&A and MCQ settings. We conduct experiments on 8 representative methods, including Feature-randomized based methods and gradient-ascent based methods. We discover that most existing unlearning methods fail to meet honest unlearning standards, particularly in acknowledging its lack of knowledge and expressing themselves consistently. We also analyze their failure reasons through the perspective of entropy and their unlearning modes. Gradient-ascent based methods perform spuriously well in selecting "I don't know" (IDK), but actually strongly avoid outputting ACBD. Among the studied methods, RMU performs closest to honest unlearning, but it still struggles with expressing its lack of knowledge and maintaining consistency while being internally confused.
Submission Number: 44
Loading