Keywords: offline reinforcement learning, online fine-tuning
TL;DR: We introduce SOAR, a simple dual-annealing strategy that smoothly removes offline data and conservative regularization during finetuning, substantially reducing catastrophic failure while improving asymptotic performance in offline-to-online RL.
Abstract: Adopting the pretrain-finetune paradigm, offline-to-online reinforcement learning (RL) first pretrains an agent on historical offline data and then fine-tunes it through online interactions, aiming to leverage prior knowledge while adapting efficiently and safely to the new environment. A central challenge, however, is the tradeoff between catastrophic failure, i.e., a sharp early collapse in performance when the agent first transitions from offline to online, and the asymptotic success rate, i.e., the long-term performance the agent ultimately achieves after sufficient training. In this article, we first conduct a systematic study using various control benchmarks and find that existing offline and offline-to-online RL methods fail to simultaneously prevent catastrophic failure and achieve high asymptotic success rates. Next, we examine how offline data and conservative regularization influence this tradeoff. Then, we identify spurious Q-optimism as the key driver of collapse, i.e., early in fine-tuning, the learned value function can mistakenly rank inferior actions above those from offline training, steering the policy toward failure. Finally, we introduce Smooth Offline-to-Online Annealing for RL (SOAR), a simple but effective dual-annealing scheme that gradually reduces reliance on offline data and conservative penalties, thereby mitigating catastrophic failure while improving long-term performance. We carry out extensive numerical experiments to confirm the efficacy and robustness of SOAR across diverse RL tasks.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 10498
Loading