Revisiting Design Choices in Offline Model Based Reinforcement LearningDownload PDF

21 May 2021 (modified: 05 May 2023)NeurIPS 2021 SubmittedReaders: Everyone
Keywords: Model-Based Reinforcement Learning, Offline Reinforcement Learning, Uncertainty Quantification
Abstract: Offline reinforcement learning enables agents to make use of large pre-collected datasets of environment transitions and learn control policies without the need for potentially expensive or unsafe online data collection. Recently, significant progress has been made in offline RL, with the dominant approach becoming methods which leverage a learned dynamics model. This typically involves constructing a probabilistic model, and using it to penalize rewards in regions of high uncertainty, solving for a pessimistic MDP that lower bounds the true MDP. Recent work, however, exhibits a breakdown between theory and practice, whereby pessimistic return ought to be bounded by the total variation distance of the model from the true dynamics, but is instead implemented through a penalty based on estimated model uncertainty. This has spawned a variety of uncertainty heuristics, with little to no comparison between differing approaches. In this paper, we show these heuristics have significant interactions with other design choices, such as the number of models in the ensemble, the model rollout length and the penalty weight. Furthermore, we compare these uncertainty heuristics under a new evaluation protocol that, for the first time, captures the specific covariate shift induced by model-based RL. This allows us to accurately assess the calibration of different proposed penalties. Finally, with these insights, we show that selecting these key hyperparameters using Bayesian Optimization produces drastically stronger performance than existing hand-tuned methods.
Code Of Conduct: I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.
Supplementary Material: zip
11 Replies

Loading