Predictive CVaR Q-learning

ICLR 2026 Conference Submission16775 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: CVaR optmization, Risk-sensitive RL, Q-learning, Bellman equation, Policy improvemen
TL;DR: We introduce a new Bellman equation tailored for the CVaR objective, and develop an efficient Q-learning algorithm accompanied by a policy improvement theorem.
Abstract: We propose a sample-efficient Q-learning algorithm for reinforcement learning with the Conditional Value-at-Risk (CVaR) objective. Our algorithm is built upon predictive tail value function, a novel formulation of risk-sensitive action value, that admits a recursive structure as in the conventional risk-neutral Bellman equation. This structure enables the Q-learning algorithm to utilize the entire set of sample trajectories rather than relying only on worst-case outcomes, enhancing the sample efficiency. We further derive a Bellman optimality equation and a policy improvement theorem, which provide theoretical foundations of our algorithm and remedy inconsistencies that have existed in the literature. Empirical results demonstrate that our method consistently improves CVaR performance while maintaining stable and interpretable learning dynamics.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 16775
Loading