Rethinking Unlearning for Large Reasoning Models

Rethinking Unlearning for Large Reasoning Models

ACL ARR 2025 May Submission6350 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recent advances in large reasoning models (LRMs) have enabled strong multi-step reasoning capabilities. However, existing machine unlearning algorithms are tailored to standard language modeling and fail to address the unique challenges posed by LRMs. In this work, we present the first systematic study of LRM unlearning and reveal that conventional unlearning methods often overlook critical information leakage in reasoning traces, even when final answers are successfully removed. To address this, we propose **R**easoning-aware **R**epresentation **M**isdirection for **U**nlearning ($R^{2}$MU), a method that suppresses sensitive reasoning traces while preserving the model’s general reasoning ability. Our experiments demonstrate that $R^{2}$MU significantly reduces reasoning trace leakage and achieves strong performance across both reasoning and safety benchmarks, including WMDP, StrongReject, JBB-Behaviors and WildJailbreak, under state-of-the-art models such as DeepSeek-R1-Distill-LLaMA-8B and DeepSeek-R1-Distill-Qwen-14B. To the best of our knowledge, $R^{2}$MU is the first principled approach to both expose and mitigate reasoning trace leakage in LRM unlearning, while preserving reasoning ability.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: model bias/unfairness mitigation

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Submission Number: 6350

Loading