Online Pre-Training for Offline-to-Online Reinforcement Learning

Yongjae Shin; Jeonghye Kim; Whiyoung Jung; Sunghoon Hong; Deunsol Yoon; Youngsoo Jang; Geon-Hyeong Kim; Jongseong Chae; Youngchul Sung; Kanghoon Lee; Woohyung Lim

Online Pre-Training for Offline-to-Online Reinforcement Learning

Yongjae Shin, Jeonghye Kim, Whiyoung Jung, Sunghoon Hong, Deunsol Yoon, Youngsoo Jang, Geon-Hyeong Kim, Jongseong Chae, Youngchul Sung, Kanghoon Lee, Woohyung Lim

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, Offline-to-Online Reinforcement Learning, Online Pre-Training, Online Fine-Tuning

TL;DR: Offline-to-online RL often struggles with fine-tuning the offline pre-trained agents, we propose OPT, a novel pre-training phase that mitigates this issue and demonstrates strong performance.

Abstract: Reinforcement Learning (RL) has achieved notable success in tasks requiring complex decision making, with offline RL offering the ability to train agents using fixed datasets, thereby avoiding the risks and costs associated with online interactions. However, offline RL is inherently limited by the quality of the dataset, which can restrict an agent’s performance. Offline-to-online RL aims to bridge the gap between the cost-efficiency of offline RL and the performance potential of online RL by pre-training an agent offline before fine-tuning it through online interactions. Despite its promise, recent studies show that offline pre-trained agents often underperform during online fine-tuning due to inaccurate value function, with random initialization proving more effective in certain cases. In this work, we propose a novel method, Online Pre-Training for Offline-to-Online RL (OPT), to address the issue of inaccurate value estimation in offline pre-trained agents. OPT introduces a new learning phase, Online Pre-Training, which allows the training of a new value function that enhances the subsequent fine-tuning process. Implementation of OPT on TD3 and SPOT demonstrates an average 30\% improvement in performance across D4RL environments, such as MuJoCo, Antmaze, and Adroit.

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6091

Loading