POCE: Primal Policy Optimization with Conservative Estimation for Multi-constraint Offline Reinforcement Learning
Abstract: Multi-constraint offline reinforcement learning (RL) promises to learn policies that satisfy both cumulative and state- wise costs from offline datasets. This arrangement provides an effective approach for the widespread appli-cation of RL in high-risk scenarios where both cumulative and state-wise costs need to be considered simulta-neously. However, previously constrained offline RL algorithms are primarily designed to handle single-constraint problems related to cumulative cost, which faces challenges when addressing multi-constraint tasks that involve both cumulative and state-wise costs. In this work, we pro-pose a novel Primal policy Optimization with Conservative Estimation algorithm (POCE) to address the problem of multi-constraint offline RL. Concretely, we reframe the ob-jective of multi-constraint offline RL by introducing the con-cept of Maximum Markov Decision Processes (MMDP). Subsequently, we present a primal policy optimization al-gorithm to confront the multi-constraint problems, which improves the stability and convergence speed of model training. Furthermore, we propose a conditional Bell-man operator to estimate cumulative and state-wise Q-values, reducing the extrapolation error caused by out-of-distribution (OOD) actions. Finally, extensive experiments demonstrate that the POCE algorithm achieves competitive performance across multiple experimental tasks, particu-larly outperforming baseline algorithms in terms of safety. Our code is available at github. POCE.
Loading