Online Learning in Risk Sensitive constrained MDP

Arnob Ghosh; Mehrdad Moharrami

Online Learning in Risk Sensitive constrained MDP

Arnob Ghosh, Mehrdad Moharrami

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: This is the first result that shows sub-linear regret and violation result for risk sensitive constrained MDP

Abstract: We consider a setting in which the agent aims to maximize the expected cumulative reward, subject to a constraint that the entropic risk of the total utility exceeds a given threshold. Unlike the risk-neutral case, standard primal-dual approaches fail to directly yield regret and violation bounds, as value iteration with respect to a combined state-action value function is not applicable in the risk-sensitive setting. To address this, we adopt the Optimized Certainty Equivalent (OCE) representation of the entropic risk measure and reformulate the problem by augmenting the state space with a continuous budget variable. We then propose a primal-dual algorithm tailored to this augmented formulation. In contrast to the standard approach for risk-neutral CMDPs, our method incorporates a truncated dual update to account for the possible absence of strong duality. We show that the proposed algorithm achieves regret of $\tilde{\mathcal{O}}\big(V_{g,\max}K^{3/4} + \sqrt{H^4 S^2 A \log(1/\delta)}K^{3/4}\big)$ and constraint violation of $\tilde{\mathcal{O}}\big(V_{g,\max} \sqrt{ {H^3 S^2 A \log(1/\delta)}}K^{3/4} \big)$ with probability at least $1-\delta$, where $S$ and $A$ denote the cardinalities of the state and action spaces, respectively, $H$ is the episode length, $K$ is the number of episodes, $\alpha < 0$ is the risk-aversion parameter, and $V_{g,\max} = \frac{1}{|\alpha|}(\exp(|\alpha|H) - 1)$. *To the best of our knowledge, this is the first result establishing sublinear regret and violation bounds for the risk-sensitive CMDP problem.*

Lay Summary: In many practical sequential decision-making applications (e,g, finance, safe navigation, etc..), it is important to consider that the risk-based measure of the cost is below a certain threshold. Currently, the CMDP setting can only address the scenario where the expected cost is below a certain threshold. This is the first paper that obtains the regret and the violation bound for an MDP with entropic risk constraints. We show that, unlike the unconstrained setup, here, the Markovian policy may not be optimal. Hence, we augment the state-space and consider a constrained optimized certainty equivalence. In order to obtain the regret and the violation bounds, we overcome specific challenges of infinite augmented state-space and the lack of strong duality because of the non-linearity, unlike the traditional CMDP setting. Some key important questions remain open, like whether we can improve the bound.

Link To Code: https://github.com/mmoharami/Risk-Sensitive-CMDP

Primary Area: Reinforcement Learning

Keywords: Augmented State, Constrained MDP, Risk sensitive RL, Robust RL, Primal-Dual algorithm

Submission Number: 12898

Loading