IOD: An Iterative On-policy Distillation Framework for Self-Improving Language Models

IOD: An Iterative On-policy Distillation Framework for Self-Improving Language Models

ACL ARR 2026 May Submission16687 Authors

26 May 2026 (modified: 16 Jun 2026)ACL ARR 2026 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Knowledge Distillation, Large Language Models, On-policy Distillation, Iterative Learning, Sequence Generation

Abstract: Knowledge distillation is an effective approach for compressing large language models, but sequence generation distillation suffers from train-test mismatch caused by teacher forcing. On-policy distillation addresses this by training on student-generated trajectories, yet it typically lacks an explicit mechanism for selecting and reusing useful trajectories across training cycles. We propose \textbf{Iterative On-policy Distillation (IOD)}, a multi-cycle framework that constructs a teacher-filtered curriculum from student generations. At each cycle, the student generates candidates, a fixed teacher scores them by conditional likelihood, and high-scoring samples are retained, accumulated, and re-filtered for subsequent distillation. This enables quality-controlled reuse of student-generated data without an external reward model. Experiments on summarization and machine translation show that IOD improves over supervised fine-tuning and standard on-policy distillation, with ablations confirming the importance of filtering and accumulation.

Paper Type: Long

Research Area: Machine Learning for NLP

Research Area Keywords: distillation, data augmentation, generative models, text-to-text generation

Contribution Types: Approaches to low-compute settings (efficiency)

Languages Studied: English, Deutsch

EMNLP 2026 AI Reviewing Experiment: yes

Submission Number: 16687

Loading