Unlearning Isn't Invisible: Detecting Unlearning Traces in LLMs from Model Outputs

Yiwei Chen; Soumyadeep Pal; Yimeng Zhang; Qing Qu; Sijia Liu

Unlearning Isn't Invisible: Detecting Unlearning Traces in LLMs from Model Outputs

Yiwei Chen, Soumyadeep Pal, Yimeng Zhang, Qing Qu, Sijia Liu

Published: 11 Jun 2025, Last Modified: 02 Jul 2025MUGen @ ICML 2025 OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model; Machine Unlearning; LLM Safety

Abstract: Machine unlearning (MU) for large language models (LLMs) removes unwanted knowledge from a pre-trained model while preserving its utility on unrelated tasks. Despite unlearning’s benefits for privacy, copyright, and harm mitigation, we identify a new vulnerability post-unlearning: unlearning trace detection. We show that unlearning leaves a persistent “fingerprint” in model behavior that a classifier can detect from outputs, even on forget-irrelevant inputs. A simple supervised classifier could distinguish original and unlearned models with high accuracy using only their text outputs. Further analysis reveals that unlearning traces are embedded in intermediate activations and propagate to final outputs, lying on low-dimensional manifolds that classifiers can learn. We achieve over 90% accuracy on forget-related prompts and up to 94% on forget-irrelevant queries for our largest LLM, demonstrating the broad applicability of trace detection. These findings reveal that unlearning leaves measurable signatures, undermining privacy guarantees and enabling reverse-engineering of removed knowledge.

Submission Number: 43

Loading