CUP: A Conservative Update Policy Algorithm for Safe Reinforcement Learning

Long Yang; Yu Zhang; Jiaming Ji; Juntao Dai; Weidong Zhang

CUP: A Conservative Update Policy Algorithm for Safe Reinforcement Learning

Long Yang, Yu Zhang, Jiaming Ji, Juntao Dai, Weidong Zhang

29 Sept 2021 (modified: 22 Jun 2025)ICLR 2022 Conference Withdrawn SubmissionReaders: Everyone

Keywords: reinforcement learning, constrained Markov decision processes, safety learning

Abstract: Safe reinforcement learning (RL) is still very challenging since it requires the agent to consider both return maximization and safe exploration. In this paper, we propose CUP, a \textbf{C}onservative \textbf{U}pdate \textbf{P}olicy algorithm with a theoretical safety guarantee. The derivation of CUP is based on surrogate functions w.r.t. our new proposed bounds. Although using bounds as surrogate functions to design safe RL algorithms have appeared in some existing works, we develop them at least three aspects: \textbf{(i)} We provide a rigorous theoretical analysis to extend the bounds w.r.t. generalized advantage estimator (GAE). GAE significantly reduces variance while maintains a tolerable level of bias, which is an efficient step for us to design CUP; \textbf{(ii)} The proposed bounds are more compact than existing works, i.e., using the proposed bounds as surrogate functions are better local approximations to the objective and constraints. \textbf{(iii)} The bound of worst-case safe constraint violation of CUP is more compact than the existing safe RL algorithms, which explains why CUP is so good in practice. Finally, extensive experiments on continuous control tasks show the effectiveness of CUP where the agent satisfies safe constraints.

One-sentence Summary: We propose a new approach to solve safe reinforcement learning.

Supplementary Material: zip

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 3 code implementations](https://www.catalyzex.com/paper/cup-a-conservative-update-policy-algorithm/code)

19 Replies

Loading