An Optimistic Algorithm for online CMDPS with Anytime Adversarial Constraints

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We propsed a new algorithm to deal with adversarial constraint in CMDP
Abstract: Online safe reinforcement learning (RL) plays a key role in dynamic environments, with applications in autonomous driving, robotics, and cybersecurity. The objective is to learn optimal policies that maximize rewards while satisfying safety constraints modeled by constrained Markov decision processes (CMDPs). Existing methods achieve sublinear regret under stochastic constraints but often fail in adversarial settings, where constraints are unknown, time-varying, and potentially adversarially designed. In this paper, we propose the Optimistic Mirror Descent Primal-Dual (OMDPD) algorithm, the first to address online CMDPs with anytime adversarial constraints. OMDPD achieves optimal regret $\tilde{\mathcal{O}}(\sqrt{K})$ and strong constraint violation $\tilde{\mathcal{O}}(\sqrt{K})$ without relying on Slater’s condition or the existence of a strictly known safe policy. We further show that access to accurate estimates of rewards and transitions can further improve these bounds. Our results offer practical guarantees for safe decision-making in adversarial environments.
Lay Summary: Modern AI systems—like those used in self-driving cars, robots, and cybersecurity—must learn to make good decisions while following safety rules. But in the real world, these safety rules can be unpredictable: they may change over time, be unclear, or even be set up in a way that tries to trick the system. This kind of hostile or adversarial behavior makes safe learning especially difficult. In this paper, we present a new learning algorithm that helps AI systems stay safe even when the safety constraints are changing in both adversarial and stochastic ways. Our method does not assume that the system already knows what is safe, and it does not require ideal conditions to work. Instead, it learns and adapts on the challenging and uncertain environments. We show that this approach can learn effectively while also minimizing safety violations over time. This work provides stronger and more realistic guarantees for using AI safely in high-risk settings.
Primary Area: General Machine Learning->Online Learning, Active Learning and Bandits
Keywords: Constrained Reinforcement Learning; Constrained MDP; online learning
Submission Number: 9434
Loading