The Edge-of-Reach Problem in Offline Model-Based Reinforcement Learning

Anya Sims; Cong Lu; Yee Whye Teh

The Edge-of-Reach Problem in Offline Model-Based Reinforcement Learning

Anya Sims, Cong Lu, Yee Whye Teh

Published: 28 Oct 2023, Last Modified: 13 Dec 2023ALOE 2023 PosterEveryoneRevisionsBibTeX

Keywords: Offline Reinforcement Learning, Model-Based Reinforcement Learning

Abstract: Training generalist agents requires learning in complex, open-ended environments. In the real world, as well as in standard benchmarks, such environments often come with large quantities of pre-collected behavioral data. Offline reinforcement learning presents an exciting possibility for leveraging this existing data to kickstart subsequent expensive open-ended learning. Using offline data with RL, however, introduces the additional challenge of evaluating values for state-actions not seen in the dataset -- termed the out-of-sample problem. One solution to this is by allowing the agent to generate additional synthetic data through rollouts in a learned dynamics model. The prevailing theoretical understanding is that this effectively resolves the out-of-sample issue, and that any remaining difficulties are due to errors in the learned dynamics model. Based on this understanding, one would expect improvements to the dynamics model to lead to improvements to the learned policy. Surprisingly, however, we find that existing algorithms completely fail when the true dynamics are provided in place of the learned dynamics model. This observation exposes a common misconception in offline reinforcement learning, namely that dynamics model errors do not explain the behavior of model-based methods. Our subsequent investigation reveals a second major and previously overlooked issue in offline model-based reinforcement learning (which we term the edge-of-reach problem). Guided by this new insight, we propose Reach-Aware Value Learning (RAVL), a value-based algorithm that is able to capture value uncertainty at edge-of-reach states and resolve the edge-of-reach problem. Our method achieves strong performance on the standard D4RL benchmark, and we hope that the insights developed in this paper help to advance offline RL in order for it to serve as an easily applicable pre-training technique for open-ended settings.

Submission Number: 22

Loading