Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting

Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting

ICLR 2026 Conference Submission17270 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: post-training, catastrophic forgetting, supervised finetuning, reinforcement learning

TL;DR: RL forgets less than SFT due to its mode-seeking, on-policy nature, motivating the use of approximately on-policy data for SFT to reduce forgetting

Abstract: Adapting language models (LMs) to new tasks via post-training carries the risk of degrading existing capabilities---a phenomenon classically known as *catastrophic forgetting*. In this paper, we set out to identify specific guidelines to mitigate this phenomenon, by systematically comparing the forgetting patterns of supervised fine-tuning (SFT) and reinforcement learning (RL), two widely adopted post-training methods. Our experiments reveal a consistent trend across LM families (Llama, Qwen) and tasks (instruction following, general knowledge, and arithmetic reasoning): RL leads to less forgetting than SFT while achieving comparable or higher target task performance. To investigate the cause for this difference, we consider a simplified setting in which the LM is modeled as a mixture of two distributions, one corresponding to prior knowledge and the other to the target task. We identify that the *mode-seeking* nature of RL, which stems from its use of *on-policy* data, enables keeping prior knowledge intact when learning the target task. We then verify this insight by demonstrating that the use on-policy data underlies the robustness of RL to forgetting in practical settings, as opposed to other algorithmic choices such as the KL regularization or advantage estimation. Lastly, as a practical implication, our results highlight the potential of mitigating forgetting using *approximately* on-policy data, which can be substantially more efficient to obtain than fully on-policy data.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 17270

Loading