Policy Learning with Abstention
TL;DR: A two-stage abstention-based policy learner that defers on uncertain cases, achieves fast O(1/n) offline regret (including a DR version for unknown propensities), and enables safer, more robust policy improvement.
Abstract: Policy learning algorithms are regularly leveraged in domains such as personalized medicine and advertising
to develop individualized treatment regimes. One deficit of existing policy learning algorithms is that they do not adjust their decisions based on uncertainty in their predictions—that is, they fail to \textit{abstain}. To remedy this, we introduce a framework for \textit{policy learning with abstention}, in which policies that choose not to assign a treatment to some customers/patients receive a small additive reward on top of the value of a random guess. Building on empirical welfare maximization, we propose a two-stage learner that first identifies a set of near-optimal policies and then constructs an abstention class based on disagreements between the policies. We establish fast $O(1/n)$–type regret guarantees for the learned policy from offline data when the treatment propensity in the offline data is known, and we show how to extend these guarantees to the unknown–propensity case via a doubly robust (DR) objective. Further, we use our algorithm as a black box to obtain improved guarantees under margin conditions that go beyond realizability, which has been a standard assumption in prior work on policy learning with a margin. We also study links to distributionally robust policy learning—where abstention acts as a hedge against small shifts—and to safe policy improvement, where the objective is to improve upon a given baseline policy with high probability. We validate our theoretical findings through extensive synthetic experiments.
Submission Number: 1968
Loading