Old Memories Die Hard: Understanding Challenges of Privacy Unlearning in Large Language Models

Old Memories Die Hard: Understanding Challenges of Privacy Unlearning in Large Language Models

ICLR 2026 Conference Submission13739 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Unlearning, Relearning Attack, Privacy Leakage, CKA

Abstract: Large language models (LLMs) often memorize private information during training, raising serious privacy concerns. While machine unlearning has emerged as a promising solution, its true effectiveness against privacy attacks remains unclear. To address this, we propose Prileak, a new evaluation framework that systematically assesses unlearning robustness through three-tier attack scenarios: direct retrieval, in-context learning recovery, and fine-tuning restoration; combined with quantitative analysis using forgetting scores, association metrics, and forgetting depth assessment. Our study exposes significant weaknesses in current unlearning methods, revealing two key findings: 1) unlearning exhibits ripple effects across gradient-based associated data, and 2) most methods suffer from shallow forgetting, failing to remove private information distributed across multiple model layers. Building on these findings, we propose two key strategies: association-aware core-set selection that leverages gradient similarity, and multi-layer deep intervention by progressive learning rates and representational constraints. These strategies represent a paradigm shift from shallow forgetting to deep forgetting.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 13739

Loading