When Unlearning Backfires: Partial Unlearning Increases PII Regurgitation and enables data extraction in Meta’s Llama 3.2 1B

ICLR 2026 Conference Submission8845 Authors

17 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Unlearning
TL;DR: Partial unlearning on LLaMA 3.2 1B suppresses the targeted Harry Potter content but increases PII regurgitation under related prompts, revealing a new safety risk that calls for PII-focused evaluations and stronger upstream data hygiene.
Abstract: We study partial unlearning—selectively removing only a subset of a knowledge source—and its safety side effects in LLaMA 3.2 1B. Using a targeted pipeline that unlearns the seven Harry Potter novels while retaining related web content, we find that standard unlearning metrics improve (explicit references drop with minimal utility loss) yet unintended memorization risks worsen: the unlearned model more often regurgitates training snippets containing PII, and this effect amplifies when red-teaming. The result exposes a failure mode where steering away from the removed source shifts the model toward memorized remnants in adjacent data. We argue evaluations of unlearning should include PII-regurgitation stress tests, and that safeguards must prioritize upstream data hygiene, especially for open-source releases. Our findings underscore that an alignment mechanism like unlearning may have unintended side-effects and that there is a need for more rigorous AI safety evaluations.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 8845
Loading