Breaking Safety Paradox with Feasible Dual Policy Iteration

Breaking Safety Paradox with Feasible Dual Policy Iteration

ICLR 2026 Conference Submission15600 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Safety paradox, safe reinforcement learning, feasible policy iteration, feasibility function

TL;DR: Introduces an additional violation-seeking policy to overcome sparsity of unsafe samples.

Abstract: Achieving zero constraint violations in safe reinforcement learning poses a significant challenge. We discover a key obstacle called the safety paradox, where improving policy safety reduces the frequency of constraint-violating samples, thereby impairing feasibility function estimation and ultimately undermining policy safety. We theoretically prove that the estimation error bound of the feasibility function increases as the proportion of violating samples decreases. To overcome the safety paradox, we propose an algorithm called feasible dual policy iteration (FDPI), which employs an additional policy to strategically maximize constraint violations while staying close to the original policy. Samples from both policies are combined for training, with data distribution corrected by importance sampling. Extensive experiments show FDPI's state-of-the-art performance on the Safety-Gymnasium benchmark, achieving the lowest violation and competitive-to-best return simultaneously.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 15600

Loading