TL;DR: Provable algorithms for risk-sensitive RL with static OCE that reduce to standard RL oracles in an augmented MDP.
Abstract: We study risk-sensitive RL where the goal is learn a history-dependent policy that optimizes some risk measure of cumulative rewards.
We consider a family of risks called the optimized certainty equivalents (OCE), which captures important risk measures such as conditional value-at-risk (CVaR), entropic risk and Markowitz's mean-variance. In this setting, we propose two meta-algorithms: one grounded in optimism and another based on policy gradients, both of which can leverage the broad suite of risk-neutral RL algorithms in an augmented Markov Decision Process (MDP). Via a reductions approach, we leverage theory for risk-neutral RL to establish novel OCE bounds in complex, rich-observation MDPs. For the optimism-based algorithm, we prove bounds that generalize prior results in CVaR RL and that provide the first risk-sensitive bounds for exogenous block MDPs. For the gradient-based algorithm, we establish both monotone improvement and global convergence guarantees under a discrete reward assumption. Finally, we empirically show that our algorithms learn the optimal history-dependent policy in a proof-of-concept MDP, where all Markovian policies provably fail.
Lay Summary: In high-stakes settings (e.g., healthcare, finance, systems), we often care not only about the average outcome, but also about avoiding bad outcomes, tail events, or reducing variance. Our paper proposes a framework for solving these risk-sensitive applications via reinforcement learning with the optimized certainty equivalent—a broad class of risk measures that captures important cases such as Conditional Value-at-Risk (CVaR) and mean-variance. We reduce the challenging risk-sensitive RL problem into a standard RL problem, enabling the use of many existing algorithms from the literature. By combining our reduction with risk-neutral RL methods, we derive strong theoretical guarantees even in tasks with high-dimensional state spaces, such as exogenous block MDPs. In sum, our work shows that practical, risk-sensitive objectives can be addressed using well-established RL techniques through a principled reduction framework.
Primary Area: Theory->Reinforcement Learning and Planning
Keywords: Risk-Sensitive RL, Reduction to RL, OCE, CVaR, Variance
Submission Number: 6401
Loading