- Abstract: In many reinforcement learning applications, it is crucial that the agent interacts with the environment only through safe policies, i.e.,~policies that do not take the agent to certain undesirable situations. These problems are often formulated as a constrained Markov decision process (CMDP) in which the agent's goal is to optimize its main objective while not violating a number of safety constraints. In this paper, we propose safe policy optimization algorithms that are based on the Lyapunov approach to CMDPs, an approach that has well-established theoretical guarantees in control engineering. We first show how to generate a set of state-dependent Lyapunov constraints from the original CMDP safety constraints. We then propose safe policy gradient algorithms that train a neural network policy using DDPG or PPO, while guaranteeing near-constraint satisfaction at every policy update by projecting either the policy parameter or the action onto the set of feasible solutions induced by the linearized Lyapunov constraints. Unlike the existing (safe) constrained PG algorithms, ours are more data efficient as they are able to utilize both on-policy and off-policy data. Furthermore, the action-projection version of our algorithms often leads to less conservative policy updates and allows for natural integration into an end-to-end PG training pipeline. We evaluate our algorithms and compare them with CPO and the Lagrangian method on several high-dimensional continuous state and action simulated robot locomotion tasks, in which the agent must satisfy certain safety constraints while minimizing its expected cumulative cost.
- Keywords: Reinforcement Learning, Safe Learning, Lyapunov Functions, Constrained Markov Decision Problems
- TL;DR: Safe Reinforcement Learning Algorithms for Continuous Control