Keywords: Class Unlearning, Machine Unlearning
Abstract: Class unlearning seeks to remove the influence of designated training classes while retaining utility on the remaining ones, often for privacy or regulatory compliance. Existing evaluations largely declare success once the forgotten classes exhibit near-zero accuracy or fail membership inference tests. We argue this view is incomplete and introduce the notion of *the illusion of forgetting*: even when accuracy appears suppressed, the black-box outputs of unlearned models can retain residual, recoverable signals about forgotten classes. We formalize this phenomenon by quantifying residual information in the output space and show that unlearning trajectories leave statistically distinguishable signatures. To demonstrate practical implications, we propose a simple yet effective post-hoc recovery framework, which amplifies weak signals using a Yeo–Johnson transformation and adapts decision thresholds to reconstruct predictions for forgotten classes. Across 12 unlearning algorithms and 4 benchmark datasets, our framework substantially restores forgotten-class accuracy while causing minimal degradation on retained classes. These findings (i) expose critical blind spots in current unlearning evaluations, (ii) provide the first systematic evidence that forgotten-class utility can be restored from black-box access alone.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 1374
Loading