Knowledge Leakage in Unlearned Language Models via Multi-Hop Reasoning

Knowledge Leakage in Unlearned Language Models via Multi-Hop Reasoning

ACL ARR 2026 January Submission10113 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Machine Unlearning, Safety and Alignment, Reasoning-Based Leakage

Abstract: Large language models (LLMs) increasingly need mechanisms to remove specific information, motivated by privacy regulation, content removal, and alignment with changing norms. Unlearning methods aim to erase targeted knowledge while preserving overall utility, yet it is unclear whether these methods truly delete information or simply suppress it. We study a failure mode in which erased knowledge re-emerges under step-by-step prompting, a phenomenon we term \textit{reasoning-based leakage}. We introduce Sleek, a black-box diagnostic framework that probes unlearned models using multi-hop reasoning queries. Sleek synthesizes structured prompts, classifies responses as \textit{direct}, \textit{indirect}, or \textit{implied}, and evaluates both incomplete forgetting and unintended suppression of retained knowledge. Across four representative unlearning techniques and two open-weight LLMs, Sleek reveals systematic leakage: erased facts are recoverable in up to 62.5\% of cases, and collateral forgetting occurs in 50\% of retained queries. Rather than proposing a new unlearning algorithm, this work provides an evaluation perspective showing why suppression persists and how leakage arises through reasoning. Sleek offers a practical tool for comparing (model, unlearning) configurations and highlights a broader challenge: reliable and verifiable forgetting remains unsolved for safety-critical LLM deployments.

Paper Type: Short

Research Area: Safety and Alignment in LLMs

Research Area Keywords: Language Modeling, Interpretability and Analysis of Models for NLP, Resources and Evaluation

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 10113

Loading