The infinite-horizon discounted objective is popular in reinforcement learning, partly due to stationary optimal policies and convenient analysis based on contracting Bellman operators. Unfortunately, optimal policies must be history-dependent for most common coherent risk-averse discounted objectives, such as Value at Risk (VaR) and Conditional Value at Risk (CVaR). They also must be computed using complex state augmentation schemes. In this paper, we show that the total reward objective, under the Entropic Risk Measure (ERM) and Entropic Value at Risk (EVaR), can be optimized by a stationary policy, an essential property for practical implementations. In addition, an optimal policy can be efficiently computed using linear programming. Importantly, our results only require the relatively mild condition of transient MDPs and allow for both positive and negative rewards, unlike prior work requiring assumptions on the sign of the rewards. Our results suggest that the total reward criterion may be preferable to the discounted criterion in a broad range of risk-averse reinforcement learning problems.
Keywords: MDP, EVaR, Stationary policy, Total reward
Abstract:
Submission Number: 25
Loading