Keywords: offline bandit learning, sample complexity, conditional value at risk, risk sensitive learning, reinforcement learning
Abstract: The conventional offline bandit policy learning literature aims to find a policy that performs well in terms of the average policy effect (APE) on the population, i.e. the **social welfare**. However, in many settings, including healthcare and public policies, the decision-maker also concerns about the **risk** of implementing certain policy. The optimal policy that maximizes social welfare could have a risk of negative effect on some percentage of the worst-affected population, hence not the ideal policy. In this paper, we investigate risk sensitive offline policy learning and its sample complexity, with conditional value at risk (CVaR) of covariate-conditional average policy effect (CAPE) as the risk measure.
To this end, we first provide a doubly-robust estimator for the CVaR of CAPE, and show that the this estimator enjoys asymptotic normality even if the nuisance parameters suffer a slower-than-$n^{-\frac{1}{2}}$ estimation rate ($n$ being the sample size).
We then propose a risk sensitive learning algorithm that finds the policy maximizing the weighted sum of APE and CVaR of CAPE, within a given policy class $\Pi$.
We show that the sample complexity of the proposed algorithm is of the order $O(\kappa(\Pi)n^{-\frac{1}{2}})$, where $\kappa(\Pi)$ is the entropy integral of $\Pi$ under the Hamming distance. The proposed methods are evaluated empirically, demonstrating that by sacrificing not much of the social welfare, our methodology improves the outcome of the worst-affected population.
Primary Area: causal reasoning
Submission Number: 21599
Loading