Keywords: offline rl, bayesian inference, gaussian processes
TL;DR: We develop a Bayesian formulation of offline RL by encoding the necessary conservatism in the value function prior.
Abstract: Offline reinforcement learning (RL) seeks to improve a policy using only a fixed dataset of past interactions, without further environment exploration. To avoid overly optimistic decisions on uncertain or out-of-distribution actions, the learned policy should be supported by the data. Existing methods typically enforce this via heuristic modifications to objectives or value functions that encourage more conservative action selection. In contrast, we propose a principled alternative by introducing a conservative value prior, thereby modelling the belief that policies are expected to perform poorly unless the behavioural data provides evidence to the contrary. This yields a posterior that assigns high value only to supported actions, guiding the agent toward policies grounded in the data. Our approach thereby unifies Bayesian decision-making, uncertainty quantification and value regularisation while effectively mitigating distributional shift in offline RL.
We develop this framework in a model-free setting for continuous control in deterministic environments. We first present an exact inference algorithm for small-scale problems, then extend it to a scalable deep learning variant compatible with standard off-policy algorithms. Our method achieves strong performance on benchmark locomotion tasks, outperforming comparable model-free baselines thanks to the milder yet effective form of regularisation employed.
Confirmation: I understand that authors of each paper submitted to EWRL may be asked to review 2-3 other submissions to EWRL.
Serve As Reviewer: ~Filippo_Valdettaro1
Track: Regular Track: unpublished work
Submission Number: 90
Loading