Keywords: reinforcement learning, offline-to-online RL, online fine-tuning, stability and plasticity
Abstract: Offline-to-online reinforcement learning (RL) has emerged as a practical paradigm that leverages offline datasets for pretraining and online interactions for fine-tuning. However, its empirical behavior is highly inconsistent: design choices of online fine-tuning that work well in one setting can fail completely in another. Guided by the stability--plasticity principle, we propose a framework that can explain this inconsistency: We argue that efficient fine-tuning must preserve the utility of the stronger offline prior, whether that is the pretrained policy or the offline dataset, while maintaining sufficient plasticity. This perspective identifies three regimes of online fine-tuning, each requiring distinct stability properties. We validate this framework through a large-scale empirical study, finding that the results strongly align with its predictions in 45 out of 63 cases, with only 3 opposite mismatches. This work provides a framework for guiding design choices in offline-to-online RL based on the relative performance of the offline dataset and the pretrained policy.
Submission Number: 66
Loading