Keywords: offline reinforcement learning, unsupervised learning, data sharing
TL;DR: We propose a principled way to leverage unlabeled offline RL dataset with guarantees in linear MDPs and it outperforms previous methods.
Abstract: Self-supervised methods play a vital role in fueling the progress of deep learning using supervision from the data itself, obviating the need for expensive annotations. The same merit applies to offline reinforcement learning (RL), which conducts RL in a supervised manner, but it is unclear how to utilize such unlabeled data to improve offline RL in a principled way. In this paper, we examine the theoretical benefit of unlabeled data in the context of linear MDPs and propose a novel and Provable Data Sharing algorithm, which we refer to as PDS, to utilize such unlabeled data for offline RL. PDS utilizes additional penalties upon the reward function learned from labeled data to avoid potential overestimation of the reward. We show that such a penalty is crucial to keep the algorithm conservative, and PDS achieves a provable benefit from unlabeled data under mild conditions. We conduct extensive experiments on various offline RL tasks and show that PDS can significantly improve offline RL algorithms with unlabeled data.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Supplementary Material: zip
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Reinforcement Learning (eg, decision and control, planning, hierarchical RL, robotics)