RL's Razor: Why On-Policy Reinforcement Learning Forgets Less

Idan Shenfeld; Jyothish Pari; Pulkit Agrawal

RL's Razor: Why On-Policy Reinforcement Learning Forgets Less

Idan Shenfeld, Jyothish Pari, Pulkit Agrawal

Published: 23 Sept 2025, Last Modified: 11 Nov 2025CCFM OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, Catstrophic Forgetting, Continual Learning

Abstract: We compare fine-tuning models with supervised fine-tuning (SFT) and reinforcement learning (RL) and find that, even at matched new-task accuracy, RL consistently forgets less. We investigate the cause and show that the degree of forgetting is not determined by the training algorithm itself, but by the distributional shift, namely the KL divergence between the fine-tuned and base policy when evaluated on the new task distribution. RL’s advantage arises because on-policy updates bias optimization toward KL-minimal solutions among the many that solve a task, whereas SFT can converge to distributions arbitrarily far from the base model. We validate this across experiments with large language models and controlled toy settings, as well as provide theory on why on-policy RL updates lead to a smaller KL change. We term this principle \textit{RL’s Razor}: among all ways to solve a new task, RL prefers those closest in KL to the original model.

Serve As Reviewer: ~Jyothish_Pari1

Submission Number: 23

Loading