Reference-Specific Unlearning Metrics Can Hide the Truth: A Reality Check

Sungjun Cho; Dasol Hwang; Frederic Sala; Sangheum Hwang; Kyunghyun Cho; Sungmin Cha

Reference-Specific Unlearning Metrics Can Hide the Truth: A Reality Check

Sungjun Cho, Dasol Hwang, Frederic Sala, Sangheum Hwang, Kyunghyun Cho, Sungmin Cha

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Unlearning Evaluation

TL;DR: We propose a novel metric for LLM unlearning that compares models at the distribution-level.

Abstract: Evaluating the effectiveness of unlearning in large language models (LLMs) remains a key challenge, especially as existing metrics often rely on specific reference outputs. The widely used *forget quality* metric from the TOFU benchmark compares likelihoods over paraphrased answers but is highly sensitive to the choice of the reference answers, potentially obscuring whether a model has truly forgotten the targeted information. We argue that unlearning should instead be assessed via distributional equivalence---how closely an unlearned model aligns functionally with the retain-only model. To this end, we propose **Functional Alignment for Distributional Equivalence (FADE)**, a novel distribution-level metric that compares two distributions of textual outputs. FADE provides a more robust, principled approach to evaluating unlearning by comparing model behavior beyond isolated responses.

Submission Number: 178

Loading