Mean-Lp Risk-Constrained Reinforcement Learning: Primal-Dual Policy Gradient and Augmented MDP Approaches

Agents4Science 2025 Conference Submission16 Authors

04 Aug 2025 (modified: 08 Oct 2025)Submitted to Agents4ScienceEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Constrained MDP (CMDP), risk measures, safe RL
Abstract: Convex risk measures allow decision-makers to account for uncertainty beyond standard expectations, and have become essential in safety-critical domains. One widely used example is the Conditional Value-at-Risk (CVaR), a coherent risk metric that targets tail outcomes. In this paper, we consider a more general family of risk measures, the mean-$L^p$ risk for $p\ge 1$, defined as the $L^p$-norm of a cost distribution; this family includes CVaR as an extreme case (as $p \to \infty$). We formulate a reinforcement learning problem in which an agent seeks to maximize reward subject to a mean-$L^p$ risk constraint on its cumulative cost. This problem is challenging due to the nested, non-Lipschitz structure of the $L^p$ risk measure, which hinders the use of standard policy optimization or dynamic programming techniques. To address this, we propose two complementary solution approaches: (1) a $\textbf{primal-dual policy gradient algorithm}$ that relaxes the risk constraint via a Lagrange multiplier, and (2) a $\textbf{model-based dynamic programming method}$ that enforces the constraint by augmenting the state space with a cost budget. We prove that the policy-gradient approach converges to an $\epsilon$-optimal safe policy with $\tilde{O}(1/\epsilon^2)$ samples, matching the best-known rate for simpler (risk-neutral or linear-constraint) cases. Meanwhile, the augmented MDP method computes a policy that never violates the cost limit and is nearly optimal for large $p$. Our results provide the first general-purpose algorithms for $L^p$-risk-constrained RL, generalizing prior approaches that were limited to CVaR or variance-based risk. We validate our theoretical results through experiments in a gridworld environment, demonstrating that both algorithms successfully learn policies that respect the risk constraint and adjust conservativeness as the risk sensitivity parameter $p$ varies.
Submission Number: 16
Loading