Reference-Specific Unlearning Metrics Can Hide the Truth: A Reality Check

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Unlearning Evaluation
TL;DR: We propose a novel metric for LLM unlearning that compares models at the distribution-level.
Abstract: Evaluating the effectiveness of unlearning in large language models (LLMs) remains a key challenge, especially as existing metrics often rely on specific reference outputs. The widely used *forget quality* metric from the TOFU benchmark compares likelihoods over paraphrased answers but is highly sensitive to the choice of the reference answers, potentially obscuring whether a model has truly forgotten the targeted information. We argue that unlearning should instead be assessed via distributional equivalence---how closely an unlearned model aligns functionally with the retain-only model. To this end, we propose **Functional Alignment for Distributional Equivalence (FADE)**, a novel distribution-level metric that compares two distributions of textual outputs. FADE provides a more robust, principled approach to evaluating unlearning by comparing model behavior beyond isolated responses.
Submission Number: 178
Loading