IOD: An Iterative On-policy Distillation Framework for Self-Improving Language Models

ACL ARR 2026 May Submission16687 Authors

26 May 2026 (modified: 16 Jun 2026)ACL ARR 2026 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Knowledge Distillation, Large Language Models, On-policy Distillation, Iterative Learning, Sequence Generation
Abstract: Knowledge distillation is an effective approach for compressing large language models, but sequence generation distillation suffers from train-test mismatch caused by teacher forcing. On-policy distillation addresses this by training on student-generated trajectories, yet it typically lacks an explicit mechanism for selecting and reusing useful trajectories across training cycles. We propose \textbf{Iterative On-policy Distillation (IOD)}, a multi-cycle framework that constructs a teacher-filtered curriculum from student generations. At each cycle, the student generates candidates, a fixed teacher scores them by conditional likelihood, and high-scoring samples are retained, accumulated, and re-filtered for subsequent distillation. This enables quality-controlled reuse of student-generated data without an external reward model. Experiments on summarization and machine translation show that IOD improves over supervised fine-tuning and standard on-policy distillation, with ablations confirming the importance of filtering and accumulation.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: distillation, data augmentation, generative models, text-to-text generation
Contribution Types: Approaches to low-compute settings (efficiency)
Languages Studied: English, Deutsch
EMNLP 2026 AI Reviewing Experiment: yes
Submission Number: 16687
Loading