Policy Optimization in CMDPs with Bandit Feedback: Learning with Stochastic and Adversarial Constraints

Francesco Emanuele Stradi; Anna Lunghi; Matteo Castiglioni; Alberto Marchesi; Nicola Gatti

Policy Optimization in CMDPs with Bandit Feedback: Learning with Stochastic and Adversarial Constraints

Francesco Emanuele Stradi, Anna Lunghi, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti

Published: 23 Sept 2025, Last Modified: 01 Dec 2025ARLETEveryoneRevisionsBibTeXCC BY 4.0

Track: Research Track

Keywords: Policy Optimization, CMDPs

Abstract: We study online learning in \emph{constrained Markov decision processes} (CMDPs) in which rewards and constraints may be either \emph{stochastic} or \emph{adversarial}. In such settings, Stradi et al. (2024) proposed the first \emph{best-of-both-worlds} algorithm able to seamlessly handle stochastic and adversarial constraints, achieving optimal regret and constraint violation bounds in both cases. This algorithm suffers from two major drawbacks. First, it only works under \emph{full feedback}, which severely limits its applicability in practice. Moreover, it relies on optimizing over the space of occupancy measures, which requires solving convex optimization problems, an highly inefficient task. In this paper, we provide the first \emph{best-of-both-worlds} algorithm for CMDPs with \emph{bandit feedback}. Specifically, when the constraints are \emph{stochastic}, the algorithm achieves $\widetilde{\mathcal{O}}(\sqrt{T})$ regret and constraint violation, while, when they are \emph{adversarial}, it attains $\widetilde{\mathcal{O}}(\sqrt{T})$ constraint violation and a tight fraction of the optimal reward. Moreover, our algorithm is based on a policy optimization approach, which is much more efficient than occupancy-measure-based methods.

Submission Number: 5

Loading