Keywords: LLM unlearning; LLM Alignment; In-context Learning; Unlearning Evaluation; Chain-of-Thought
TL;DR: We propose DRAGON, a lightweight black-box unlearning framework that leverages detection and chain-of-thought reasoning to enforce safe, in-context interventions without modifying the underlying LLM.
Abstract: Unlearning in Large Language Models (LLMs) is crucial for protecting private data and removing harmful knowledge. Existing methods typically rely on fine-tuning and require access to retain data, which is often unavailable in real-world scenarios. To overcome these limitations, we propose \textbf{D}etect-\textbf{R}easoning \textbf{A}ugmented \textbf{G}enerati\textbf{ON} (\textbf{DRAGON}), a systematic, reasoning-based framework that applies in-context chain-of-thought (CoT) instructions to guard deployed LLMs before inference. DRAGON identifies forget-worthy prompts using a lightweight detection module and routes them through a CoT guard model for safe intervention without modifying the base model or requiring retain data.
To robustly evaluate unlearning performance, we introduce novel metrics for unlearning performance and the continual unlearning setting. Extensive experiments across three representative unlearning tasks validate the effectiveness of DRAGON, demonstrating its strong unlearning capability, scalability, and applicability in practical data-limited scenarios.
Submission Number: 18
Loading