Keywords: post-training, catastrophic forgetting, supervised finetuning, reinforcement learning
TL;DR: RL forgets less than SFT due to its mode-seeking, on-policy nature, motivating the use of approximately on-policy data for SFT to reduce forgetting
Abstract: Adapting language models (LMs) to new tasks via post-training carries the risk of degrading existing capabilities---a phenomenon classically known as *catastrophic forgetting*. In this paper, we set out to identify specific guidelines to mitigate this phenomenon, by systematically comparing the forgetting patterns of supervised fine-tuning (SFT) and reinforcement learning (RL), two widely adopted post-training methods. Our experiments reveal a consistent trend across LM families (Llama, Qwen) and tasks (instruction following, general knowledge, and arithmetic reasoning): RL leads to less forgetting than SFT while achieving comparable or higher target task performance.
To investigate the cause for this difference, we consider a simplified setting in which the LM is modeled as a mixture of two distributions, one corresponding to prior knowledge and the other to the target task. We identify that the *mode-seeking* nature of RL, which stems from its use of *on-policy* data, enables keeping prior knowledge intact when learning the target task. We then verify this insight by demonstrating that the use on-policy data underlies the robustness of RL to forgetting in practical settings, as opposed to other algorithmic choices such as the KL regularization or advantage estimation. Lastly, as a practical implication, our results highlight the potential of mitigating forgetting using *approximately* on-policy data, which can be substantially more efficient to obtain than fully on-policy data.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 17270
Loading