Offline Reinforcement Learning with Pessimistic Value Priors

Filippo Valdettaro; Aldo A. Faisal

Offline Reinforcement Learning with Pessimistic Value Priors

Filippo Valdettaro, Aldo A. Faisal

Published: 19 Jun 2024, Last Modified: 26 Jul 2024ARLET 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: offline reinforcement learning, gaussian processess, bayesian inference, uncertainty quantification

TL;DR: Carrying out inference in value-function space with a pessimistic prior leads to high values only if the policy is supported, thus enabling offline RL.

Abstract: We mitigate the effect of distribution shift in offline reinforcement learning by regularisation through value function inference with a pessimistic prior as a mechanism to induce critic conservatism and avoid unsupported policies. By introducing a pessimistic prior on the value of the learned policy and carrying out inference in value function space, the resulting posterior will only have high action-values in regions where these are supported by the dataset. Regularisation through inference has the potential to be not as aggressively conservative as other forms of regularisation, such as those that try to be robust to worst-case outcomes given the data, while still avoiding out-of-distribution actions. We develop this approach for continuous control and propose a way to make it scalable and compatible with deep learning architectures. As a byproduct of this inference scheme we also obtain consistent Bayesian uncertainty for model-free off-policy evaluation from a non-episodic dataset of individual transitions. We develop this framework for control in continuous-action environments and present results on a toy environment with exact inference and preliminary results on a scalable, deep version of our framework on a D4RL benchmark robotics task. Our methods show potential for improved performance on such a task, and suggest that future experimental work on improving training stability of our methods could result in effective offline reinforcement learning algorithms coming from simple modifications of online algorithms.

Submission Number: 88

Loading