CUP: A Conservative Update Policy Algorithm for Safe Reinforcement LearningDownload PDF


Sep 29, 2021 (edited Oct 05, 2021)ICLR 2022 Conference Blind SubmissionReaders: Everyone
  • Keywords: reinforcement learning, constrained Markov decision processes, safety learning
  • Abstract: Safe reinforcement learning (RL) is still very challenging since it requires the agent to consider both return maximization and safe exploration. In this paper, we propose CUP, a \textbf{C}onservative \textbf{U}pdate \textbf{P}olicy algorithm with a theoretical safety guarantee. The derivation of CUP is based on surrogate functions w.r.t. our new proposed bounds. Although using bounds as surrogate functions to design safe RL algorithms have appeared in some existing works, we develop them at least three aspects: \textbf{(i)} We provide a rigorous theoretical analysis to extend the bounds w.r.t. generalized advantage estimator (GAE). GAE significantly reduces variance while maintains a tolerable level of bias, which is an efficient step for us to design CUP; \textbf{(ii)} The proposed bounds are more compact than existing works, i.e., using the proposed bounds as surrogate functions are better local approximations to the objective and constraints. \textbf{(iii)} The bound of worst-case safe constraint violation of CUP is more compact than the existing safe RL algorithms, which explains why CUP is so good in practice. Finally, extensive experiments on continuous control tasks show the effectiveness of CUP where the agent satisfies safe constraints.
  • One-sentence Summary: We propose a new approach to solve safe reinforcement learning.
  • Supplementary Material: zip
0 Replies