Off-Policy Token Clipped Supervised Fine-Tuning Yields a Robust Cold-Start

Off-Policy Token Clipped Supervised Fine-Tuning Yields a Robust Cold-Start

ICLR 2026 Conference Submission19335 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Supervised Fine-Tuning, Reinforcement Learning

TL;DR: We introduce a novel supervised fine-tuning strategy inspired by trust region methods in RL, which enables a more stable, on-policy-like training dynamic and preserves pre-existing knowledge for subsequent RL.

Abstract: Supervised Fine-Tuning (SFT) is a critical step for adapting Large Language Models (LLMs) to specialized domains, often serving as a cold-start for subsequent reinforcement learning (RL). However, SFT's tendency to memorize a small set of expert data for a downstream task can impair generalization and lead to catastrophic forgetting of prior knowledge, undermining the promise of effective RL. In this paper, we demonstrate that this degradation primarily results from tokens in the expert data to which the base model assigns low probability. Specifically, we frame these as 'off-policy' tokens, as they represent a significant deviation from the model's current prior knowledge. Due to the nature of the log-likelihood objective, these off-policy tokens produce larger gradient magnitudes, destabilizing the training process. To investigate this phenomenon, we adopt a well-established clipping strategy from reinforcement learning, which is widely used to manage off-policy data in an on-policy manner. Applying this strategy to SFT moderates the learning process by constraining gradient updates from off-policy tokens, creating a more on-policy-like training dynamic. Through extensive experiments on the agentic benchmarks ALFWorld and ScienceWorld, we discover that this clipped approach, compared to standard SFT, reduces forgetting on out-of-distribution tasks by 11.54\% and boosts final RL performance by 6.70\%. Furthermore, latent-space analysis validates our initial claim, showing that applying the off-policy token clipped strategy results in less model's internal representational drift than standard SFT and is thus key to preserving prior knowledge.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 19335

Loading