Keywords: Large Language Models, Machine Unlearning, Safety and Alignment, Reasoning-Based Leakage
Abstract: Large language models (LLMs) increasingly need mechanisms to remove specific information, motivated by privacy regulation, content removal, and alignment with changing norms. Unlearning methods aim to erase targeted knowledge while preserving overall utility, yet it is unclear whether these methods truly delete information or simply suppress it. We study a failure mode in which erased knowledge re-emerges under step-by-step prompting, a phenomenon we term \textit{reasoning-based leakage}. We introduce Sleek, a black-box diagnostic framework that probes unlearned models using multi-hop reasoning queries. Sleek synthesizes structured prompts, classifies responses as \textit{direct}, \textit{indirect}, or \textit{implied}, and evaluates both incomplete forgetting and unintended suppression of retained knowledge. Across four representative unlearning techniques and two open-weight LLMs, Sleek reveals systematic leakage: erased facts are recoverable in up to 62.5\% of cases, and collateral forgetting occurs in 50\% of retained queries. Rather than proposing a new unlearning algorithm, this work provides an evaluation perspective showing why suppression persists and how leakage arises through reasoning. Sleek offers a practical tool for comparing (model, unlearning) configurations and highlights a broader challenge: reliable and verifiable forgetting remains unsolved for safety-critical LLM deployments.
Paper Type: Short
Research Area: Safety and Alignment in LLMs
Research Area Keywords: Language Modeling, Interpretability and Analysis of Models for NLP, Resources and Evaluation
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 10113
Loading