VOCE: Variational Optimization with Conservative Estimation for Offline Safe Reinforcement Learning

Published: 21 Sept 2023, Last Modified: 25 Dec 2023NeurIPS 2023 posterEveryoneRevisionsBibTeX
Keywords: Offline safe reinforcement learning, Pessimistic conservative estimation, Variational optimization, Reinforcement Learning
Abstract: Offline safe reinforcement learning (RL) algorithms promise to learn policies that satisfy safety constraints directly in offline datasets without interacting with the environment. This arrangement is particularly important in scenarios with high sampling costs and potential dangers, such as autonomous driving and robotics. However, the influence of safety constraints and out-of-distribution (OOD) actions have made it challenging for previous methods to achieve high reward returns while ensuring safety. In this work, we propose a Variational Optimization with Conservative Eestimation algorithm (VOCE) to solve the problem of optimizing safety policies in the offline dataset. Concretely, we reframe the problem of offline safe RL using probabilistic inference, which introduces variational distributions to make the optimization of policies more flexible. Subsequently, we utilize pessimistic estimation methods to estimate the Q-value of cost and reward, which mitigates the extrapolation errors induced by OOD actions. Finally, extensive experiments demonstrate that the VOCE algorithm achieves competitive performance across multiple experimental tasks, particularly outperforming state-of-the-art algorithms in terms of safety.
Supplementary Material: zip
Submission Number: 10715
Loading