Keywords: LLM Unlearning, Relearning Attack, Privacy Leakage, CKA
Abstract: Large language models (LLMs) often memorize private information during training, raising serious privacy concerns. While machine unlearning has emerged as a promising solution, its true effectiveness against privacy attacks remains unclear.
To address this, we propose Prileak, a new evaluation framework that systematically assesses unlearning robustness through three-tier attack scenarios: direct retrieval, in-context learning recovery, and fine-tuning restoration; combined with quantitative analysis using forgetting scores, association metrics, and forgetting depth assessment.
Our study exposes significant weaknesses in current unlearning methods, revealing two key findings: 1) unlearning exhibits ripple effects across gradient-based associated data, and 2) most methods suffer from shallow forgetting, failing to remove private information distributed across multiple model layers.
Building on these findings, we propose two key strategies: association-aware core-set selection that leverages gradient similarity, and multi-layer deep intervention by progressive learning rates and representational constraints. These strategies represent a paradigm shift from shallow forgetting to deep forgetting.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 13739
Loading