Existing Adversarial Large Language Model Unlearning Evaluations Are Inconclusive

27 Jan 2026 (modified: 13 Mar 2026)Withdrawn by AuthorsEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Unlearning seeks to remove sensitive knowledge from large language models, with success often judged through adversarial evaluations. In this work, we critically examine these evaluation practices and reveal key limitations that undermine their reliability. First, we show that adversarial evaluations introduce new information into the model, potentially masking true unlearning performance by re-teaching the model during evaluation. Second, we show that evaluation outcomes vary significantly across tasks, undermining the generalizability of current evaluation methods. Collectively, these issues suggest that existing evaluations risk mischaracterizing unlearning success (or failure). To address this, based on our empirical findings, we propose two principles—*minimal information injection* and *downstream task awareness*—for future evaluations.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Chuan-Sheng_Foo1
Submission Number: 7203
Loading