Learning Adversarial MDPs with Stochastic Hard Constraints

Francesco Emanuele Stradi; Matteo Castiglioni; Alberto Marchesi; Nicola Gatti

Learning Adversarial MDPs with Stochastic Hard Constraints

Francesco Emanuele Stradi, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti

Published: 01 May 2025, Last Modified: 13 Aug 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We study online learning in constrained Markov decision processes (CMDPs) with adversarial losses and stochastic hard constraints, under bandit feedback. We consider three scenarios. In the first one, we address general CMDPs, where we design an algorithm attaining sublinear regret and cumulative positive constraints violation. In the second scenario, under the mild assumption that a policy strictly satisfying the constraints exists and is known to the learner, we design an algorithm that achieves sublinear regret while ensuring that constraints are satisfied at every episode with high probability. In the last scenario, we only assume the existence of a strictly feasible policy, which is not known to the learner, and we design an algorithm attaining sublinear regret and constant cumulative positive constraints violation. Finally, we show that in the last two scenarios, a dependence on the Slater's parameter is unavoidable. To the best of our knowledge, our work is the first to study CMDPs involving both adversarial losses and hard constraints. Thus, our algorithms can deal with general non-stationary environments subject to requirements much stricter than those manageable with existing ones, enabling their adoption in a much wider range of applications.

Lay Summary: This work investigates online learning in constrained Markov decision processes (CMDPs) with adversarial losses and stochastic hard constraints, using bandit feedback. We propose algorithms for three scenarios, achieving sublinear regret and constraint control under various assumptions about the feasibility of policies. This study is the first to address CMDPs with both adversarial losses and hard constraints, broadening the applicability of CMDPs to more complex and demanding environments.

Primary Area: Theory->Online Learning and Bandits

Keywords: CMDP, hard constraints, online learning

Submission Number: 11614

Loading