Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

Published: 17 Jul 2025, Last Modified: 06 Sept 2025EWRL 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Discounted MDPs; Risk-sensitive reinforcement learning; Sample complexity; Lower bound; Generative Model
TL;DR: We present novel upper and lower sample-complexity bounds for risk-sensitive RL under entropic risk measures.
Abstract: In this paper we analyze the sample complexities of learning the optimal state-action value function $ Q^* $ and an optimal policy $ \pi^* $ in a discounted Markov decision process (MDP) where the agent has recursive entropic risk-preferences with risk-parameter $ \beta\neq 0 $ and where a generative model of the MDP is available. We provide and analyze a simple model based approach which we call model-based risk-sensitive $Q$-value-iteration (MB-RS-QVI) which leads to $ (\epsilon,\delta) $-PAC-bounds on $ \|Q^* - Q^k\| $, and $ \|V^*-V^{\pi_k}\| $ where $ Q_k $ is the output of MB-RS-QVI after k iterations and $ \pi_k $ is the greedy policy with respect to $ Q_k $. Both PAC-bounds have exponential dependence on the effective horizon $ \frac{1}{1-\gamma} $ and the strength of this dependence grows with the learners risk-sensitivity $ |\beta| $. We also provide two lower bounds which shows that exponential dependence on $ |\beta|\frac{1}{1-\gamma} $ is unavoidable in both cases. The lower bounds reveal that the PAC-bounds are both tight in $ \varepsilon $ and $ \delta $ and that the PAC-bound on $Q$-learning is tight in the number of actions $A$, and that the PAC-bound on policy-learning is nearly tight in $A$.
Confirmation: I understand that authors of each paper submitted to EWRL may be asked to review 2-3 other submissions to EWRL.
Serve As Reviewer: ~Oliver_Mortensen1
Track: Regular Track: unpublished work
Submission Number: 52
Loading