TMS: Trajectory-Mixed Supervision for On-Policy Self Distillation

Rana Shahroz; Zijie Liu; Zhen Tan; Charles Fleming; Tianlong Chen

TMS: Trajectory-Mixed Supervision for On-Policy Self Distillation

Rana Shahroz, Zijie Liu, Zhen Tan, Charles Fleming, Tianlong Chen

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) are the two dominant paradigms for enhancing Large Language Model (LLM) performance on downstream tasks. While RL generally preserves broader model capabilities (retention) better than SFT, it comes with significant costs: complex reward engineering, instability, and expensive on-policy sampling. In contrast, SFT is efficient but brittle, often suffering from catastrophic forgetting due to $\textbf{Supervision Mismatch}$: the divergence between the model's evolving policy and static training labels. We address this trade-off with $\textbf{Trajectory-Mixed Supervision (TMS)}$, a reward-free framework that approximates the on-policy benefits of RL by creating a dynamic curriculum from the model's own historical checkpoints. TMS minimizes $\textit{Policy-Label Divergence (PLD)}$, preventing the mode collapse that drives forgetting in standard SFT. Experiments across reasoning (MATH, GSM8K) and instruction-following benchmarks demonstrate that TMS effectively shifts the accuracy-retention Pareto frontier. While RL remains the gold standard for retention, TMS significantly outperforms standard and iterative SFT, bridging the gap to RL without requiring reward models or verifiers. Mechanistic analysis confirms that PLD drift accurately predicts forgetting, and that TMS successfully mitigates this drift.

Lay Summary: Large language models are often improved by training them on a specific task, such as solving math problems or following instructions. A common training method is simple and cheap, but it can make the model forget skills it already had, like general reasoning, question answering, or safety-related behavior. More robust reinforcement learning methods can reduce this forgetting, but they usually require reward signals, verifiers, and expensive repeated sampling. We study why this trade-off happens and find that one cause is a mismatch between a changing model and fixed training answers. As the model learns, it may discover several reasonable ways to answer, while standard training keeps forcing it toward one static example. Our method, Trajectory-Mixed Supervision, reuses answers generated by saved versions of the same model during training and mixes them with the original answers. This gives the model several plausible ways to learn the task, closer to its own learning path, without requiring rewards. Across math, reasoning, and instruction-following benchmarks, this preserves much more of the model’s previous ability while keeping strong gains on the target task.

Primary Area: Deep Learning->Everything Else

Keywords: Fine-tuning, SFT, RLFT, Data-centric

Originally Submitted PDF: pdf

Submission Number: 18540

Loading