SelfCAD: Protecting Your Efficient Reasoning Capabilities via Self Cautious Insertion

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Anti-distillation, Reasoning model, Copyright protection
Abstract: Large reasoning models (LRMs) are increasingly deployed in modern AI systems due to their accuracy, efficiency, and transparency, as their reasoning traces enable users and auditors to interpret model outputs. However, publishing these traces introduces new risks. Adversaries may distill them to replicate efficient LRMs for their own purpose or build proxy models for malicious attacks, raising both copyright and security concerns that threaten the sustainability of the LLM ecosystem. Existing defenses mainly detect distillation after violations occur or suppress transparency by masking or rewriting reasoning traces, which are impractical in real-world deployments. In this work, we propose a defense framework that preserves reasoning traces while preventing effective distillation. We begin with a systematic analysis of how different reasoning components affect model efficiency and accuracy. Our results reveal that the number of self-cautious sentences plays a crucial role: excessive self-cautious sentences lead to redundant outputs, while insufficient ones harm accuracy. Building on this insight, we propose $\textbf{SelfCAD (Self-Cautious Anti-Distillation)}$, a lightweight anti-distillation method that strategically manipulates self-cautious parts after models generate their reasoning traces. SelfCAD maintains the semantic clarity of reasoning traces for human users and LLM auditors, but significantly degrades the efficiency and accuracy of the downstream distilled models. Experiments on Llama and Qwen show that distilled models incur higher inference cost and lower accuracy, especially for Qwen-1.5B, whose token length is $4.8\times$ longer on GSM8K after distillation with our processed responses compared with distillation with vanilla responses. The results highlight a new efficiency-based perspective on safeguarding reasoning models from distillation while preserving interpretability.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 11654
Loading