Mirror Mirror on the Wall, Have I Forgotten it All? A New Framework for Evaluating Machine Unlearning

14 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: machine learning, machine unlearning, indistinguishability, alignment, cryptography
TL;DR: We show that machine unlearning techniques fail to be indistinguishable from a control; we propose a strong formal definition for machine unlearning and prove feasibility results.
Abstract: Machine unlearning methods take a model trained on a dataset $\mathcal{D}$, a forget set $\mathcal{D}_f$, and attempts to produce a model as if it had only been trained on $\mathcal{D} \setminus \mathcal{D}_f$. We empirically show that an adversary is able to distinguish between a mirror model (a control model produced by retraining without the data to forget) and a model produced by an unlearning method across representative unlearning methods from the literature. We use distinguishing algorithms based on evaluation scores in the literature (i.e. membership inference scores) and Kullback-Leibler divergence. We propose a strong formal definition for machine unlearning called computational unlearning. Computational unlearning is defined as the inability for an adversary to distinguish between a mirror model and a model produced by an unlearning method. Our computational unlearning definition allows us to prove feasibility results and demonstrate that current methodology in the literature -- such as differential privacy -- fundamentally falls short of achieving computational unlearning. We leave achieving practical computational unlearning for future work..
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 5284
Loading