Towards Understanding the Benefits of Online Imitation Learning

Published: 03 Mar 2026, Last Modified: 01 Apr 2026SPOTEveryoneRevisionsBibTeXCC BY 4.0
Keywords: imitation learning, post-training, misspecification, large language models
Abstract: Online imitation learning (IL), particularly on-policy distillation with a reverse-KL objective, has emerged as a strong approach for LLM post-training, often outperforming offline baselines such as supervised fine-tuning (SFT). However, a principled understanding of when and why online interaction is beneficial remains unclear. In this work, we show that the benefits of online interaction depend critically on whether the setting is realizable, i.e., whether the student policy class can represent the expert policy. Under realizability, we empirically show that offline IL already matches the expert’s performance in Countdown and reasoning tasks, challenging the common explanation that online IL benefits from mitigating error accumulation. In contrast, in non-realizable (misspecified) settings, we prove that offline IL encounters an information-theoretic bottleneck. Moreover, under severe misspecification where the distributional discrepancy between the expert and any student distribution is large, existing analyses are insufficient to explain the effectiveness of online IL. To address this gap, we introduce a structural characterization of misspecification relative to the reward, under which online IL provably achieves high performance despite large expert–student discrepancy.
Submission Number: 61
Loading