RL Fine-Tuning Heals the OOD Forgetting in SFT

Published: 23 Sept 2025, Last Modified: 07 Dec 2025FoRLM 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning, Supervised Fine-tuning, OOD Forgetting, Two-stage Fine-tuning
Abstract: The two-stage fine-tuning paradigm of Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has empirically shown better performance than one-stage SFT for the post-training of Large Language Models (LLMs). However, the evolution and mechanism behind the synergy of SFT and RL are still under-explored and inconclusive. To figure out this issue, we dissect the Out-Of-Distribution (OOD) vs. In-Distribution (ID) generalization performance of LLaMA-3.2-11B and Qwen-2.5-7B during the fine-tuning (full-parameter, rather than LoRA) process, and conduct fine-grained analysis. Besides the simple forgetting issue of SFT, we have other interesting findings: (1) The subsequent RL stage does not generate fundamentally new capabilities, instead it plays a \textbf{memory restoration} role, recovering most of the OOD performance which is lost during SFT; (2) The memory recovery ability has a limit, \ie{} \textbf{if SFT trains for too long, RL cannot recover the lost OOD ability, and the ID test loss cannot indicate the limit;} (3) To uncover the underlying mechanisms behind the forgetting and restoration process, we employ SVD analysis of the parameter matrices. Unlike the common belief that the shift of model capacity mainly result from the changes of singular values, we find that they are actually quite stable throughout fine-tuning. Instead, the OOD behavior strongly correlates with the \textbf{rotation of singular vectors}. In a nutshell, SFT performs hard alignment of the crucial parameter directions to the target tasks, leading to \textbf{rapid and greedy adjustment, but also quick forgetting}; RL then \textbf{softly and slowly re-aligns singular vectors} towards a more robust configuration, healing the forgetting and learning the downstream tasks simultaneously. We again validate the role of singular vectors by manually editing the model parameters. Our findings re-identify the role of RL in the two-stage fine-tuning and discover the rotation of singular vectors as the key mechanism.
Submission Number: 57
Loading