Keywords: Offline-to-online RL, Q-learning
Abstract: Offline-to-Online Reinforcement Learning (O2O RL) presents a compelling framework for deploying decision-making agents in domains where online data collection is limited by practical constraints such as cost, risk, or latency. In this paradigm, agents are initially trained on fixed datasets and subsequently refined through limited online interaction. However, this transition exposes a fundamental challenge: the misestimation of state-action values associated with out-of-distribution (OOD) actions, inherited from the offline phase. Such misestimations can severely destabilize online adaptation, leading to suboptimal policy behavior. To address this, we propose an algorithm-agnostic method that regularizes the Q-function prior to fine-tuning by injecting structured noise into dataset actions. This process explicitly bounds Q-value estimates across the entire action space, not just in-distribution actions, mitigating both overestimation and underestimation. We introduce a tunable parameter that governs the degree of conservatism and optimism within the Q-value bounds during the online fine-tuning phase. Extensive empirical evaluations on standard O2O RL benchmarks demonstrate that our method yields substantial improvements over strong baselines, both in terms of stability and final performance. These results underscore the importance of principled Q-function initialization and offer a practical path toward more robust reinforcement learning under distributional shift.
Primary Area: reinforcement learning
Submission Number: 9039
Loading