Trust Region On-Policy Distillation

Trust Region On-Policy Distillation

ACL ARR 2026 May Submission15475 Authors

26 May 2026 (modified: 02 Jun 2026)ACL ARR 2026 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: on-policy distillation, language model

Abstract: On-Policy Distillation (OPD) is a fundamental technique for efficient post-training of large language models (LLMs), with broad applications in agent learning, multi-task enhancement, and model compression. However, OPD training becomes unstable when the teacher and student distributions differ substantially, as teacher supervision on student-generated tokens may yield unreliable policy gradients and even cause optimization failure. This work addresses reliable on-policy token-level supervision through credit assignment strategies, and proposes Trust Region On-Policy Distillation, TrOPD. It features the following characteristics: 1) Trust-Region On-Policy Learning: TrOPD performs OPD only in regions where the teacher provides reliable supervision, mitigating the optimization difficulty of the K1 reverse-KL estimator under distribution mismatch. 2) Outlier Estimation: For outlier regions, we explore gradient clipping, masking, and forward-KL estimation to reduce the adverse effects of unreliable supervision. 3) Off-Policy Guidance: The student continues generation from teacher prefixes and uses forward KL to imitate off-policy guidance, encouraging on-policy exploration toward reliable regions. Experiments show that TrOPD consistently outperforms SoTA OPD baselines, including OPD, EOPD, and REOPOLD, across mathematical reasoning, code generation, and general-domain benchmarks.

Paper Type: Long

Research Area: Efficient Methods for NLP

Research Area Keywords: distillation

Languages Studied: English

EMNLP 2026 AI Reviewing Experiment: no

Reassignment Request Area Chair: This is not a resubmission

Reassignment Request Reviewers: This is not a resubmission

Visa Needs: yes

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: No

A2 Elaboration: This paper presents work whose goal is to advance the field of Machine Learning. There may be potential societal consequences of our work, none of which we feel must be specifically highlighted here.

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B2 Discuss The License For Artifacts: No

B2 Elaboration: We only use publicly available open-source artifacts under their respective licenses and terms of use, and do not redistribute any new artifacts.

B3 Artifact Use Consistent With Intended Use: No

B3 Elaboration: We only use publicly available open-source artifacts under their respective licenses and terms of use, and do not redistribute any new artifacts.

B4 Data Contains Personally Identifying Info Or Offensive Content: No

B5 Documentation Of Artifacts: Yes

B5 Elaboration: Section 5.1

B6 Statistics For Data: No

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: Section 5.1

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: Section 5.1

C3 Descriptive Statistics: Yes

C3 Elaboration: Section 5

C4 Parameters For Packages: N/A

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

E Ai Assistants In Research Or Writing: No

E1 Information About Use Of Ai Assistants: N/A

Author Submission Checklist: yes

Submission Number: 15475

Loading