Keywords: Reinforcement Learning, Catstrophic Forgetting, Continual Learning
Abstract: We compare fine-tuning models with supervised fine-tuning (SFT) and reinforcement learning (RL) and find that, even at matched new-task accuracy, RL consistently forgets less. We investigate the cause and show that the degree of forgetting is not determined by the training algorithm itself, but by the distributional shift, namely the KL divergence between the fine-tuned and base policy when evaluated on the new task distribution. RL’s advantage arises because on-policy updates bias optimization toward KL-minimal solutions among the many that solve a task, whereas SFT can converge to distributions arbitrarily far from the base model. We validate this across experiments with large language models and controlled toy settings, as well as provide theory on why on-policy RL updates lead to a smaller KL change. We term this principle \textit{RL’s Razor}: among all ways to solve a new task, RL prefers those closest in KL to the original model.
Serve As Reviewer: ~Jyothish_Pari1
Submission Number: 23
Loading