Model-Based Offline Reinforcement Learning with Conservative Bidirectional Rollouts

Zixian Zhou; Xiang Ao; Yang Liu; Qing He

Model-Based Offline Reinforcement Learning with Conservative Bidirectional Rollouts

Zixian Zhou, Xiang Ao, Yang Liu, Qing He

18 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX

Keywords: Offline reinforcement learning, model-based policy optimization, conservative bidirectional rollouts

TL;DR: The paper devises a conservative bidirectional model-based algorithm for offline reinforcement learning.

Abstract: Offline reinforcement learning (offline RL) learns from an offline dataset without further interactions with the environment. Although such offline training patterns can avoid cost and damage in the real environment, one main challenge is the distributional shift between the state-action pairs visited by the learned policy and those in the offline dataset. Prevailed existing model-based offline RL approaches learn a dynamics model from the dataset and perform pessimistic policy optimization based on uncertainty estimation. However, the inaccurate quantification of model uncertainty may incur the poor generalization and performance of model-based approaches, especially in the datasets lacking of sample diversity. To tackle this limitation, we instead design a novel framework for model-based offline RL, named Conservative Offline Bidirectional Model-based Policy Optimization (abbr. as COBiMO). First, we learn an ensemble bidirectional model from the offline dataset and construct long bidirectional rollouts by joining two unidirectional ones, thereby increasing the diversity of the model rollouts. Second, we devise a conservative rollout method that minimizes the reconstruction loss, further improving the sample accuracy. We theoretically prove that the bound of rollout error of COBiMO is tighter than the ones using the unidirectional models. Empirical results also show that COBiMO outperforms previous offline RL algorithms on the widely used benchmark D4RL.

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 1080

Loading