Feasible Policy Optimization for Safe Reinforcement Learning

Feasible Policy Optimization for Safe Reinforcement Learning

ICLR 2026 Conference Submission15571 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Safe reinforcement learning, feasible policy iteration, region-wise policy optimization, constraint decay function

TL;DR: Proposes a region-wise policy optimization method that monotonically improves safety and performance.

Abstract: Policy gradient methods serve as a cornerstone of reinforcement learning (RL), yet their extension to safe RL, where policies must strictly satisfy safety constraints, remains challenging. While existing methods enforce constraints in every policy update, we demonstrate that this is unnecessarily conservative. Instead, each update only needs to progressively expand the feasible region while improving the value function. Our proposed algorithm, namely feasible policy optimization (FPO), simultaneously achieves both objectives by solving a region-wise policy optimization problem. Specifically, FPO maximizes the value function inside the feasible region and minimizes the feasibility function outside it. We prove that these two sub-problems share a common optimal solution, which is obtained based on a tight bound we derive on the constraint decay function. Extensive experiments on the Safety-Gymnasium benchmark show that FPO achieves excellent constraint satisfaction while maintaining competitive task performance, striking a favorable balance between safety and return compared to state-of-the-art safe RL algorithms.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 15571

Loading