Optimality of Stationary Policies in Risk-averse Total-reward MDPs with EVaR

Published: 17 Jun 2024, Last Modified: 27 Jun 2024FoRLaC PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The risk-neutral discounted objective is popular in reinforcement learning, in part due to existence of stationary optimal policies and convenient analysis based on contracting Bellman operators. Unfortunately, for some common risk-averse discounted objectives, such as Value at Risk (VaR) and Conditional Value at Risk (CVaR), optimal policies must be history-dependent and must be computed using complex state augmentation. In this paper, we show that the risk-averse total reward objective, under the Entropic Risk Measure (ERM) and Entropic Value at Risk (EVaR), can be optimized by a stationary policy, an important property for practical implementations. In addition, an optimal policy can be efficiently computed using value iteration, policy iteration, and even linear programming. Importantly, our results only require the relatively mild condition of transient MDPs, and allow for both positive and negative rewards, unlike prior work requiring assumptions on the sign of the rewards. Overall, our results suggest that the total reward criterion may be preferable to the discounted criterion in a broad range of risk-averse reinforcement learning problems.
Format: Long format (up to 8 pages + refs, appendix)
Publication Status: No
Submission Number: 10
Loading