Masked Distillation: Internalizing Chain-of-Thought in Small Language Models

Published: 26 May 2026, Last Modified: 26 May 2026ICML 2026 FoGen Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Knowledge distillation, Large reasoning models, Inference cost
Abstract: Large reasoning models (LRMs) produce long, explicit chains of intermediate steps before generating a final answer at inference time. These intermediate traces dominate latency, memory usage, and serving cost, even though their length is not a reliable indicator of the true computational complexity of the problem instance. This raises a natural question: can the computation expressed in these intermediate tokens be internalized into the parameters of a smaller student model through knowledge distillation, enabling the student to produce answers directly (or with much shorter intermediate traces), and how does such internalization affect performance on out-of-distribution problems? We investigate this question through controlled distillation experiments, transferring knowledge from a Qwen3-4B thinking teacher mdoel to a Qwen2.5-0.5B-Instruct student model across two reasoning domains: GSM8K (grade-school arithmetic) and Countdown (a number-puzzle search task). We vary two key design dimensions. The first is the amount of intermediate scaffolding provided to the student during training: a non-masked regime, where the student is trained on the teacher’s full thinking-plus-solution trace, and a masked regime, where the student is trained to predict the solution from the prompt along with a fixed budget of teaching tokens $k \in \{0,100,1000\}$ sampled from the teacher’s trace. The second dimension is the training objective: reverse-KL on-policy distillation, where the student generates responses and is trained to match the teacher’s response distribution, versus supervised fine-tuning (SFT) on teacher-generated rollouts (off-policy).
Submission Number: 178
Loading