Near-Optimal Sample Complexity for Iterated CVaR Reinforcement Learning with a Generative Model
TL;DR: This paper studied the sample complexity of Iterated CVaR RL problem given access to a generative model with matching upper and lower bound.
Abstract: In this work, we study the sample complexity problem of risk-sensitive Reinforcement Learning (RL) with a generative model, where we aim to maximize the Conditional Value at Risk (CVaR) with risk tolerance level $\tau$ at each step, named Iterated CVaR.
We develop nearly matching upper and lower bounds on the sample complexity for this problem. Specifically, we first prove that a value iteration-based algorithm, ICVaR-VI, achieves an $\epsilon$-optimal policy with at most $\tilde{\mathcal{O}}\left(\frac{SA}{(1-\gamma)^4\tau^2\epsilon^2}\right)$ samples, where $\gamma$ is the discount factor, and $S, A$ are the sizes of the state and action spaces. Furthermore,
if $\tau \geq \gamma$, then the sample complexity can be further improved to $\tilde{\mathcal{O}}\left( \frac{SA}{(1-\gamma)^3\epsilon^2} \right)$.
We further show a minimax lower bound of ${\tilde{\mathcal{O}}}\left(\frac{(1-\gamma \tau)SA}{(1-\gamma)^4\tau\epsilon^2}\right)$.
For a constant risk level $0<\tau\leq 1$, our upper and lower bounds match with each other, demonstrating the tightness and optimality of our analyses.
We also investigate a limiting case with a small risk level $\tau$, called Worst-Path RL, where the objective is to maximize the minimum possible cumulative reward. We develop matching upper and lower bounds of $\tilde{\mathcal{O}}\left(\frac{SA}{p_{\min}}\right)$, where
$p_{\min}$ denotes the minimum non-zero reaching probability of the transition kernel.
Submission Number: 1474
Loading