Reference-Specific Unlearning Metrics Can Hide the Truth: A Reality Check

Published: 11 Jun 2025, Last Modified: 19 Jun 2025MUGen @ ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Unlearning Evaluation, Probabilistic Precision and Recall
TL;DR: We propose a novel metric for LLM unlearning that compares models at the distribution-level.
Abstract:

Evaluating the effectiveness of unlearning in large language models (LLMs) remains a key challenge, especially as existing metrics often rely on specific reference outputs. The widely used forget quality metric from the TOFU benchmark compares likelihoods over paraphrased answers but is highly sensitive to the choice of these references, potentially obscuring whether a model has truly forgotten the targeted information. We argue that unlearning should instead be assessed via distributional equivalence---how closely an unlearned model aligns functionally with the retain-only model. To this end, we propose Functional Alignment for Distributional Equivalence (FADE), a novel distribution-level metric that measures probabilistic precision and recall between model outputs. FADE provides a more robust, principled approach to evaluating unlearning by comparing model behavior beyond isolated responses.

Submission Number: 16
Loading