- Keywords: Constrained Reinforcement Learning, Pareto optimization, Constrained Markov Decision Process
- Abstract: Constrained Reinforcement Learning (CRL) burgeons broad interest in recent years, which pursues both goals of maximizing long-term returns and constraining costs. Although CRL can be cast as a multi-objective optimization problem, it is still largely unsolved using standard Pareto optimization approaches. The key challenge is that gradient-based Pareto optimization agents tend to stick to known Pareto-optimal solutions even when they yield poor returns (i.e., the safest self-driving car that never moves) or violates the constraints (i.e., the record breaking racer that crashes the car). In this paper, we propose a novel Pareto optimization method for CRL with two gradient recalibration techniques to overcome the challenge. First, to explore around feasible Pareto optimal solutions, we use gradient re-balancing to let the agent improve more on under-optimized objectives at each policy update. Second, to escape from infeasible solutions, we propose gradient perturbation to temporarily sacrifice return to save costs. Experiments on the SafetyGym benchmarks show that our method consistently outperforms previous CRL methods in return while satisfying the cost constraints.
- One-sentence Summary: We propose a novel Constrained Reinforcement Learning paradigm from the perspective of searching Pareto-optimal policy.