Making Complex Reasoning Student-Friendly: A Hybrid LLM-to-SLM Distillation Framework

Published: 03 Mar 2026, Last Modified: 01 Apr 2026SPOTEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Distillation, Small Language Models, On-Policy Distillation, Prefix, Hierarchical Reasoning
TL;DR: We propose a framework that bridges off-policy and on-policy distillation by dynamically handing off generation from teacher to student based on student confidence, creating hybrid reasoning traces that are both high-quality and easy to learn.
Abstract: Distilling the reasoning capabilities of large language models into smaller ones remains challenging. Off-policy distillation imitates fixed teacher trajectories and often suffers from teacher–student misalignment, leading to superficial pattern memorization. In contrast, on-policy distillation improves alignment by relying on student-generated solutions with teacher feedback, but struggles with exploration and frequently fails on complex problems when the student cannot generate valid solutions from scratch. To bridge this gap, we propose Student-Friendly Distillation (SFD), a framework that synergizes off-policy teacher trajectories with on-policy student generations. Specifically, SFD performs hybrid teacher-student generation where, starting from a teacher prefix, it switches to on-policy student generation once the student is sufficiently confident and aligned. The timing of this transition is governed by a dynamic hand-off criterion based on the student’s token entropy and its negative log-likelihood of the teacher’s reasoning tokens. After completion, the student rewrites the full solution to retain the teacher’s high-level reasoning in its own style, yielding trajectories that preserve quality while remaining in-distribution. Extensive experiments across six reasoning benchmarks demonstrate that SFD consistently outperforms both off-policy distillation and the on-policy rejection fine-tuning method, without access to teacher log probabilities, which are costly to compute.
Submission Number: 13
Loading