ORAD: a new framework of offline Reinforcement Learning with Q-value regularization

Longfei Zhang, Yulong Zhang, Shixuan Liu, Li Chen, Xingxing Liang, Guangquan Cheng, Zhong Liu

Published: 01 Jan 2024, Last Modified: 25 May 2024Evol. Intell. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Offline Reinforcement Learning (RL) defines a framework for learning from previously collected static buffer. However, offline RL is prone to approximation errors caused by out-of-distribution (OOD) data and particularly inefficient for pixel-based learning tasks compared with state-based input control methods. Several pioneer efforts have been made to solve this problem; some use pessimistic Q-values approximation for unseen observation while others train a model to simulate the environment to train a model on previously collected data to learn policies. However, these methods require accurate and time-consuming estimation of the Q-values or the environment models. Based on this observation, we present offline RL methods with augmented data (ORAD), a handy but non-trivial extension to offline RL algorithms. We show that simple data augmentations, e.g. random translation and random crop, significantly elevate the performance of the state-of-the-art offline RL algorithms. Besides, we find that regularization of the Q-values can also enhance performance. Extensive experiments on the pixel-based input control-Atari demonstrate the superiority of ORAD over SOTA offline RL methods considering both performance and data efficiency, and reveal that ORAD is more effective for the pixel-based control.