RAVL: Reach-Aware Value Learning for the Edge-of-Reach Problem in Offline Model-Based Reinforcement Learning

20 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Offline Reinforcement Learning, Model-Based Reinforcement Learning
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Offline reinforcement learning makes use of pre-collected datasets and has emerged as a powerful paradigm for training agents without the need for expensive or unsafe online data collection. This offline approach, however, introduces the additional challenge of evaluating values for state-actions not seen in the dataset---termed the out-of-sample problem. Model-based approaches deal with this by allowing the agent to collect additional data through rollouts in a learned dynamics model. The prevailing theoretical understanding is that this effectively resolves the out-of-sample issue, and that any remaining difficulties are due to errors in the learned dynamics model. Based on this understanding, one would expect improvements to the dynamics model to lead to improvements to the learned policy. Surprisingly, however, we find that existing algorithms completely fail when the true dynamics are provided in place of the learned dynamics model. This observation exposes a common misconception in offline reinforcement learning, namely that dynamics model errors do not explain the behavior of model-based methods. Our subsequent investigation reveals a second major and previously overlooked issue in offline model-based reinforcement learning (which we term the edge-of-reach problem), whereby values of states that are only reachable in the final step of the limited horizon rollouts are pathologically overestimated, similar to the out-of-sample problem faced by model-free methods. This new insight fills some of the gaps in existing theory and allows us to reinterpret the efficacy of prior model-based methods. Guided by this understanding, we propose Reach-Aware Value Learning (RAVL), a value-based algorithm that is able to capture value uncertainty at edge-of-reach states. Our method achieves strong performance on the standard D4RL benchmark, and we hope that the insights developed in this paper aid the future design of more accurately motivated offline algorithms.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2365
Loading