DRAGON: Guard LLM Unlearning in Context via Negative Detection and Reasoning

Yaxuan Wang; Quan Liu; Chris Yuhao Liu; Jinlong Pang; Wei Wei; Yujia Bao; Yang Liu

DRAGON: Guard LLM Unlearning in Context via Negative Detection and Reasoning

Yaxuan Wang, Quan Liu, Chris Yuhao Liu, Jinlong Pang, Wei Wei, Yujia Bao, Yang Liu

Published: 11 Jun 2025, Last Modified: 01 Jul 2025MUGen @ ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM unlearning; LLM Alignment; In-context Learning; Unlearning Evaluation; Chain-of-Thought

TL;DR: We propose DRAGON, a lightweight black-box unlearning framework that leverages detection and chain-of-thought reasoning to enforce safe, in-context interventions without modifying the underlying LLM.

Abstract: Unlearning in Large Language Models (LLMs) is crucial for protecting private data and removing harmful knowledge. Existing methods typically rely on fine-tuning and require access to retain data, which is often unavailable in real-world scenarios. To overcome these limitations, we propose \textbf{D}etect-\textbf{R}easoning \textbf{A}ugmented \textbf{G}enerati\textbf{ON} (\textbf{DRAGON}), a systematic, reasoning-based framework that applies in-context chain-of-thought (CoT) instructions to guard deployed LLMs before inference. DRAGON identifies forget-worthy prompts using a lightweight detection module and routes them through a CoT guard model for safe intervention without modifying the base model or requiring retain data. To robustly evaluate unlearning performance, we introduce novel metrics for unlearning performance and the continual unlearning setting. Extensive experiments across three representative unlearning tasks validate the effectiveness of DRAGON, demonstrating its strong unlearning capability, scalability, and applicability in practical data-limited scenarios.

Submission Number: 18

Loading