Keywords: Uncertainty Quantification, Reinforcement Learning, Offline Reinforcement Learning
TL;DR: We highlight how uncertainty in current dynamics models cannot sample smooth function samples, and we propose a sampling method to do so.
Abstract: One of the great challenges with decision making tasks on real world systems is the fact that data is sparse and acquiring additional data is expensive. In these cases, it is often crucial to make a model of the environment to assist in making decisions. At the same time, limited data means that learned models are erroneous, making it just as important to equip the model with good predictive uncertainties. In the context of learning sequential decision making policies, these uncertainties can prove useful for informing which data to collect for the greatest improvement in policy performance \citep{mehta2021experimental, mehta2022exploration} or informing the policy about unsure regions of state and action space to avoid during test time \citep{yu2020mopo}. Additionally, assuming that realistic samples of the environment can be drawn, an adaptable policy can be trained that attempts to make optimal decisions for any given possible instance of the environment \citep{ghosh2022offline, chen2021offline}. In this work, we examine the so-called ``probabilistic neural network'' (PNN) model that is ubiquitous in model-based reinforcement learning (MBRL) works. We argue that while PNN models may have good marginal uncertainties, they form a distribution of non-smooth transition functions. Not only are these samples unrealistic and may hamper adaptability, but we also assert that this leads to poor uncertainty estimates when predicting multiple step trajectory estimates. To address this issue, we propose a simple sampling method that can be implemented on top of pre-existing models.We evaluate our sampling technique on a number of environments, including a realistic nuclear fusion task, and find that, not only do smooth transition function samples produce more calibrated uncertainties, but they also lead to better downstream performance for an adaptive policy.
Submission Number: 9
Loading