everyone
since 11 Jun 2025">EveryoneRevisionsBibTeXCC BY 4.0
Evaluating the effectiveness of unlearning in large language models (LLMs) remains a key challenge, especially as existing metrics often rely on specific reference outputs. The widely used forget quality metric from the TOFU benchmark compares likelihoods over paraphrased answers but is highly sensitive to the choice of these references, potentially obscuring whether a model has truly forgotten the targeted information. We argue that unlearning should instead be assessed via distributional equivalence---how closely an unlearned model aligns functionally with the retain-only model. To this end, we propose Functional Alignment for Distributional Equivalence (FADE), a novel distribution-level metric that measures probabilistic precision and recall between model outputs. FADE provides a more robust, principled approach to evaluating unlearning by comparing model behavior beyond isolated responses.