RESTOR: Knowledge Recovery via Machine Unlearning

TMLR Paper4270 Authors

20 Feb 2025 (modified: 25 Feb 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models trained on web-scale corpora can memorize undesirable datapoints containing incorrect facts, copyrighted content, or sensitive data. Recently, many machine unlearning algorithms have been proposed that aim to `erase' the effect of these datapoints from trained models -- that is, revert model behavior to be \emph{similar to a model that had never been trained on these datapoints in the first place}. However, evaluating the success of unlearning algorithms remains an open challenge. While previous work has relied on heuristics—such as verifying that the model can no longer reproduce the specific information targeted for removal while maintaining accuracy on unrelated test data— these approaches fall short of capturing the full effect of erasing the effect of datapoints. In this work, we propose the RESTOR framework for machine unlearning, which evaluates the ability of unlearning algorithms to perform targeted data erasure from models, by evaluating the ability of models to forget the knowledge introduced in these datapoints, while simultaneously recovering the model's knowledge state had it never encountered these datapoints. RESTOR helps uncover several novel insights about popular unlearning algorithms, and the mechanisms through which they operate--- for instance, identifying that some algorithms merely emphasize forgetting but not recovering knowledge, and that localizing unlearning targets can enhance unlearning performance.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Tongliang_Liu1
Submission Number: 4270
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview