Optimal Regret and Hard Violation for Constrained Markov Decision Processes with Adversarial Losses and Constraints

TMLR Paper5938 Authors

19 Sept 2025 (modified: 09 Oct 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We investigate online learning in finite-horizon episodic Constrained Markov Decision Processes (CMDPs) under the most demanding setting: adversarial losses and constraints, bandit feedback, and unknown transitions. The most popular approaches, like primal-dual or linear programming, either rely on Slater’s condition (yielding occasionally vacuous bounds) or require solving a complex optimization problem every round. Inspired by the groundbreaking work of~\citet{sinha2024optimal} in Constrained Online Convex Optimization (COCO), we map the CMDP instances to a corresponding COCO problem. Thus, creating simple and elegant algorithms that require only a single Euclidean projection per episode. Our algorithm first attains $\mathcal{\widetilde{O}}(\sqrt{T})$ regret and $\mathcal{\widetilde{O}}(\sqrt{T})$ hard cumulative constraint violation for adversarial losses and constraints, unknown transition dynamics, bandit feedback, without Slater's condition and also without access to a strictly feasible policy. We achieve $\mathcal{O}(\sqrt{T})$ regret and $\mathcal{\widetilde{O}}(\sqrt{T})$ hard violation for known transitions. Additionally, we study the remaining three permutations of known-unknown transitions and full-bandit feedback, again achieving optimal regret and hard violation bounds in each case. Besides closing several gaps in the literature, our simple construction of biased estimators for the sub-gradient could be of independent interest for didactic purposes.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Reza_Babanezhad_Harikandeh1
Submission Number: 5938
Loading