Keywords: On Policy Distillation; Capacity Gap
Abstract: On-policy distillation (OPD) is increasingly adopted in modern
post-training pipelines as a remedy for the exposure bias and
catastrophic forgetting of supervised fine-tuning. Yet stronger
teachers still frequently fail to improve, and sometimes actively
degrade, the student under OPD. We dissect OPD's training signal from three different perspectives to explain \emph{what} it learns,
\emph{where} its gains come from, and \emph{why} it sometimes hurts.
\emph{(i) What:} OPD transfers the teacher's \emph{uncertainty profile}
rather than its problem-level knowledge. \emph{(ii) Where:} gains come from aligning the student's
per-position \emph{entropy shape} with the teacher's, so already
shape-aligned pairings have no headroom to gain. \emph{(iii) Why:} the regression under OPD is caused by the teacher pulling the student off confidently-correct tokens, which triggers catastrophic forgetting during training. Together, these findings give a unified mechanistic account of when and why OPD helps or hurts, and some probe methods that predicts the
likely outcome of OPD before training begins.
Paper Type: Long
Research Area: Efficient Methods for NLP
Research Area Keywords: distillation
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
EMNLP 2026 AI Reviewing Experiment: yes
Submission Number: 17374
Loading