Keywords: LLM, Unlearning, AI safety
Abstract: Unlearning seeks to remove sensitive knowledge from large language models, with success often judged through adversarial evaluations. In this work, we critically examine these evaluation practices and reveal key limitations that undermine their reliability. First, we show that adversarial evaluations introduce new information into the model, potentially masking true unlearning performance by re-teaching the model during evaluation. Second, we show that evaluation outcomes vary significantly across tasks, undermining the generalizability of current evaluation methods. Collectively, these issues suggest that existing evaluations risk mischaracterizing unlearning success (or failure). To address this, based on our empirical findings, we propose two principles—*minimal information injection* and *downstream task awareness*—for future evaluations.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 13103
Loading