- TL;DR: We propose a new algorithm that learns constraint-satisfying policies, and provide theoretical analysis and empirical demonstration in the context of reinforcement learning with constraints.
- Abstract: In this paper, we consider the problem of learning control policies that optimize areward function while satisfying constraints due to considerations of safety, fairness, or other costs. We propose a new algorithm - Projection Based ConstrainedPolicy Optimization (PCPO), an iterative method for optimizing policies in a two-step process - the first step performs an unconstrained update while the secondstep reconciles the constraint violation by projection the policy back onto the constraint set. We theoretically analyze PCPO and provide a lower bound on rewardimprovement, as well as an upper bound on constraint violation for each policy update. We further characterize the convergence of PCPO with projection basedon two different metrics - L2 norm and Kullback-Leibler divergence. Our empirical results over several control tasks demonstrate that our algorithm achievessuperior performance, averaging more than 3.5 times less constraint violation andaround 15% higher reward compared to state-of-the-art methods.
- Code: https://sites.google.com/view/iclr2020-submission-pcpo
- Keywords: Reinforcement learning with constraints, Safe reinforcement learning