Rethinking the Capacity Gap in On-Policy Distillation for Large Language Models

ACL ARR 2026 May Submission17374 Authors

26 May 2026 (modified: 02 Jun 2026)ACL ARR 2026 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: On Policy Distillation; Capacity Gap
Abstract: On-policy distillation (OPD) is increasingly adopted in modern post-training pipelines as a remedy for the exposure bias and catastrophic forgetting of supervised fine-tuning. Yet stronger teachers still frequently fail to improve, and sometimes actively degrade, the student under OPD. We dissect OPD's training signal from three different perspectives to explain \emph{what} it learns, \emph{where} its gains come from, and \emph{why} it sometimes hurts. \emph{(i) What:} OPD transfers the teacher's \emph{uncertainty profile} rather than its problem-level knowledge. \emph{(ii) Where:} gains come from aligning the student's per-position \emph{entropy shape} with the teacher's, so already shape-aligned pairings have no headroom to gain. \emph{(iii) Why:} the regression under OPD is caused by the teacher pulling the student off confidently-correct tokens, which triggers catastrophic forgetting during training. Together, these findings give a unified mechanistic account of when and why OPD helps or hurts, and some probe methods that predicts the likely outcome of OPD before training begins.
Paper Type: Long
Research Area: Efficient Methods for NLP
Research Area Keywords: distillation
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
EMNLP 2026 AI Reviewing Experiment: yes
Submission Number: 17374
Loading