Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT?

Published: 23 May 2026, Last Modified: 23 May 2026CATS@ICML26 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Catastrophic Forgetting, Reinforcement Learning, Supervised Fine-Tuning, Mechanistic Interpretability, Circuit Analysis, Large Language Models, Continual Learning
TL;DR: RL preserves more pretrained circuits than ST during fine-tuning, at the cost of slower task adaptation. We introduce differential circuit vulnerability to quantify this trade-off and motivate cheaper circuit-aware regularisation methods.
Abstract: Fine-tuning large language models (LLMs) on new tasks often causes catastrophic forgetting, degrading prior capabilities. Recent theory suggests reinforcement learning (RL) preserves pretrained abilities better than supervised fine-tuning (SFT) because policy-gradient updates implicitly remain closer to the base policy. We combine RL theory with mechanistic interpretability to study how such distributional constraints affect forgetting at the circuit level. We introduce differential circuit vulnerability, a measure of how specific computational subgraphs degrade under fine-tuning, and use it to compare which circuits are preserved or disrupted by RL and SFT. Using Qwen2.5-3B-Instruct, we evaluate adaptation on a science task dataset and analyze retention of prior capabilities after fine-tuning. Our analysis reveals a mechanistic trade-off: SFT adapts faster but induces greater circuit disruption, whereas RL preserves more distributed functional circuits and prior capabilities, though it may under-optimize new tasks. These findings motivate circuit-aware regularization methods that aim to match RL-level retention at substantially lower computational cost.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 69
Loading