Unlearning Isn't Invisible: Detecting Unlearning Traces in LLMs from Model Outputs

Published: 11 Jun 2025, Last Modified: 11 Jun 2025MUGen @ ICML 2025 OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Model; Machine Unlearning; LLM Safety
Abstract: Machine unlearning (MU) for large language models (LLMs) removes unwanted knowledge from a pre-trained model while preserving its utility on unrelated tasks. Despite unlearning’s benefits for privacy, copyright, and harm mitigation, we identify a new post-unlearning issue: unlearning trace detection. We show that unlearning leaves a persistent ``fingerprint'' in model behavior that a classifier can detect from outputs, even on forget-irrelevant inputs. A simple supervised classifier distinguishes original and unlearned models with high accuracy using only their text outputs. Analysis reveals that unlearning traces embed in intermediate activations and propagate to final outputs, lying on low-dimensional manifolds that classifiers can learn. We achieve over 90% accuracy on forget-related prompts and up to 94% on forget-irrelevant queries for our largest LLM, demonstrating broad applicability of trace detection. These findings reveal that unlearning leaves measurable signatures, undermining privacy guarantees and enabling reverse-engineering of removed knowledge.
Submission Number: 43
Loading