Planning and Learning in Average Risk-aware MDPs

Weikai Wang; Erick Delage

Planning and Learning in Average Risk-aware MDPs

Weikai Wang, Erick Delage

Published: 18 Sept 2025, Last Modified: 13 Dec 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY-NC 4.0

Keywords: average cost markov decision processes, dynamic risk measure, relative value iteration, Q-learning, reinforcement learning

TL;DR: This work presents planning and learning algorithms for average-cost MDPs with dynamic risk measures.

Abstract: For continuing tasks, average cost Markov decision processes have well-documented value and can be solved using efficient algorithms. However, it explicitly assumes that the agent is risk-neutral. In this work, we extend risk-neutral algorithms to accommodate the more general class of dynamic risk measures. Specifically, we propose a relative value iteration (RVI) algorithm for planning and design two model-free Q-learning algorithms, namely a generic algorithm based on the multi-level Monte Carlo (MLMC) method, and an off-policy algorithm dedicated to utility-based shortfall risk measures. Both the RVI and MLMC-based Q-learning algorithms are proven to converge to optimality. Numerical experiments validate our analysis, confirm empirically the convergence of the off-policy algorithm, and demonstrate that our approach enables the identification of policies that are finely tuned to the intricate risk-awareness of the agent that they serve.

Primary Area: Reinforcement learning (e.g., decision and control, planning, hierarchical RL, robotics)

Submission Number: 3471

Loading