Sample Complexity of CVaR Based Risk Sensitive Policy Learning

Sample Complexity of CVaR Based Risk Sensitive Policy Learning

ICLR 2026 Conference Submission21599 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: offline bandit learning, sample complexity, conditional value at risk, risk sensitive learning, reinforcement learning

Abstract: The conventional offline bandit policy learning literature aims to find a policy that performs well in terms of the average policy effect (APE) on the population, i.e. the **social welfare**. However, in many settings, including healthcare and public policies, the decision-maker also concerns about the **risk** of implementing certain policy. The optimal policy that maximizes social welfare could have a risk of negative effect on some percentage of the worst-affected population, hence not the ideal policy. In this paper, we investigate risk sensitive offline policy learning and its sample complexity, with conditional value at risk (CVaR) of covariate-conditional average policy effect (CAPE) as the risk measure. To this end, we first provide a doubly-robust estimator for the CVaR of CAPE, and show that the this estimator enjoys asymptotic normality even if the nuisance parameters suffer a slower-than-$n^{-\frac{1}{2}}$ estimation rate ($n$ being the sample size). We then propose a risk sensitive learning algorithm that finds the policy maximizing the weighted sum of APE and CVaR of CAPE, within a given policy class $\Pi$. We show that the sample complexity of the proposed algorithm is of the order $O(\kappa(\Pi)n^{-\frac{1}{2}})$, where $\kappa(\Pi)$ is the entropy integral of $\Pi$ under the Hamming distance. The proposed methods are evaluated empirically, demonstrating that by sacrificing not much of the social welfare, our methodology improves the outcome of the worst-affected population.

Primary Area: causal reasoning

Submission Number: 21599

Loading