Keywords: on-policy distillation, language model
Abstract: On-Policy Distillation (OPD) is a fundamental technique for efficient post-training of large language models (LLMs), with broad applications in agent learning, multi-task enhancement, and model compression. However, OPD training becomes unstable when the teacher and student distributions differ substantially, as teacher supervision on student-generated tokens may yield unreliable policy gradients and even cause optimization failure. This work addresses reliable on-policy token-level supervision through credit assignment strategies, and proposes Trust Region On-Policy Distillation, TrOPD. It features the following characteristics:
1) Trust-Region On-Policy Learning: TrOPD performs OPD only in regions where the teacher provides reliable supervision, mitigating the optimization difficulty of the K1 reverse-KL estimator under distribution mismatch.
2) Outlier Estimation: For outlier regions, we explore gradient clipping, masking, and forward-KL estimation to reduce the adverse effects of unreliable supervision.
3) Off-Policy Guidance: The student continues generation from teacher prefixes and uses forward KL to imitate off-policy guidance, encouraging on-policy exploration toward reliable regions.
Experiments show that TrOPD consistently outperforms SoTA OPD baselines, including OPD, EOPD, and REOPOLD, across mathematical reasoning, code generation, and general-domain benchmarks.
Paper Type: Long
Research Area: Efficient Methods for NLP
Research Area Keywords: distillation
Languages Studied: English
EMNLP 2026 AI Reviewing Experiment: no
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
Visa Needs: yes
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: No
A2 Elaboration: This paper presents work whose goal is to advance the field of Machine Learning. There may be potential societal consequences of our work, none of which we feel must be specifically highlighted here.
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B2 Discuss The License For Artifacts: No
B2 Elaboration: We only use publicly available open-source artifacts under their respective licenses and terms of use, and do not redistribute any new artifacts.
B3 Artifact Use Consistent With Intended Use: No
B3 Elaboration: We only use publicly available open-source artifacts under their respective licenses and terms of use, and do not redistribute any new artifacts.
B4 Data Contains Personally Identifying Info Or Offensive Content: No
B5 Documentation Of Artifacts: Yes
B5 Elaboration: Section 5.1
B6 Statistics For Data: No
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: Section 5.1
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: Section 5.1
C3 Descriptive Statistics: Yes
C3 Elaboration: Section 5
C4 Parameters For Packages: N/A
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
E Ai Assistants In Research Or Writing: No
E1 Information About Use Of Ai Assistants: N/A
Author Submission Checklist: yes
Submission Number: 15475
Loading