Achieving $\tilde{O}(1)$ Strong Constraint Violation and Sublinear Strong Regret in Online CMDPs

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: CMDPs, Safe Reinforcement Learning, Strong Regret, Constant Constraint Violation
TL;DR: We propose FlexDOME, the first algorithm that provably achieves guarantees of (1) near-constant $\tilde{O}(1)$ strong constraint violation, (2) sublinear $\tilde{O}(T^{7/8})$ strong reward regret, and (3) last-iterate convergence.
Abstract: We study safe online reinforcement learning in Constrained Markov Decision Processes (CMDPs) under strong regret and violation metrics. Existing methods that achieve sublinear strong reward regret inevitably incur cumulative strong constraint violation that grows with the number of episodes $T$. To address this limitation, we propose Flexible safety Domain Optimization via Margin-regularized Exploration (FlexDOME), the first algorithm in the literature that provably achieves near-constant $\tilde{O}(1)$ strong constraint violation and ensures a sublinear $\tilde{O}(T^{7/8})$ strong reward regret. FlexDOME, built on the regularized primal-dual framework, introduces a decaying safety margin to the constraint threshold. This margin tightens the feasible region to avoid constraint violation, which relaxes in order $\tilde{O}(t^{-1/8})$ to guarantee feasibility, offering a proper safety-performance trade-off. We then propose a policy-dual divergence potential function that helps establish a non-asymptotic last-iterate convergence guarantee. Experiments demonstrate that FlexDOME significantly enhances safety with negligible reward sacrifice, in full agreement with the theory.
Supplementary Material: zip
Primary Area: learning theory
Submission Number: 11226
Loading