Abstract: Supervised Fine-Tuning (SFT) is a critical step for adapting Large Language Models (LLMs) to specialized domains, often serving as an initialization for subsequent reinforcement learning (RL). However, SFT can overfit a small set of expert data, harming generalization and eroding prior knowledge. This can limit downstream RL, which benefits from a strong, generalizable initialization for exploration. Here, we demonstrate that prior knowledge degradation primarily results from tokens in the expert data to which the base model assigns low probability. Specifically, these low-probability tokens represent a significant deviation from the model’s current prior knowledge. Due to the nature of the log-likelihood objective, they produce larger gradient magnitudes, which speed up adaptation to the new data but degrade generalization. In this paper, we study the token-wise clipping strategy, a commonly used trust-region method for bounding per-token updates.
We find that it reshapes token-level learning priorities, promoting more progressive adaptation that fits the new data while preserving general abilities.
Compared with standard SFT, clipping low-probability tokens reduces out-of-distribution forgetting by 11.54\% and improves final RL performance by 7.09\% across the agentic benchmarks. Moreover, latent-space analysis shows smaller representational drift under clipping, indicating that it provides a generalizable initialization.
Lay Summary: LLMs are often improved by training them on examples written by experts. This helps the model adapt to new tasks, but it can also make the model change too aggressively and forget useful abilities it learned during its original training. This is especially harmful when the model is later used as the starting point for another training stage that requires exploration and general problem-solving.
Our work studies why this forgetting happens. We find that examples containing answers that the original model considered very unlikely can create unusually large updates during training. These updates help the model fit the expert examples quickly, but they can also damage its broader abilities. To address this, we use a simple training strategy that limits how much the model can change its predictions in each training step. This encourages the model to learn new behaviors more gradually.
Across several reasoning and decision-making tasks, this approach reduces forgetting on unseen tasks and improves final performance after later training. Overall, the method provides a more stable way to adapt language models while preserving their general abilities.
Originally Submitted Supplementary Material: zip
Primary Area: Deep Learning->Large Language Models
Keywords: Large Language Models, Supervised Fine-Tuning, Reinforcement Learning
Originally Submitted PDF: pdf
Submission Number: 21123
Loading