Enhancing Offline Reinforcement Learning with an Optimal Supported Dataset

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Offline reinforcement learning
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Offline Reinforcement Learning (Offline RL) is challenged by distributional shift and value overestimation, which often leads to poor performance. To address this issue, a popular class of methods use behavior regularization to constrain the learned policy to stay close to the behavior policy. However, this approach can be too limiting when the behavior policy is suboptimal. To overcome this limitation, we propose to conduct behavior regularization directly on an optimal supported dataset, which can both ensure that the learned policy is not too far removed from the dataset, and reduce any potential bias towards the optimization objective. We introduce \textit{\textbf{O}ptimal \textbf{S}upported \textbf{D}ataset generation via Stationary \textbf{DI}stribution \textbf{C}orrection \textbf{E}stimation} (OSD-DICE) to generate such a dataset. OSD-DICE is based on the primal-dual formulation of linear programming for RL. It uses a single minimization objective to avoid poor convergence issues often associated with this formulation, and incorporates two key designs to ensure polynomial sample complexity under general function approximation and single-policy concentrability. After generating the near-optimal supported dataset, we instantiate our framework by two representative behavior regularization-based methods and show safe policy improvement over the near-optimal supported policy. Empirical results validate the efficacy of OSD-DICE on tabular tasks and demonstrate remarkable performance gains of the proposed framework on D4RL benchmarks.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5581
Loading