TL;DR: A two-stage abstention-based policy learner that defers on uncertain cases, achieves fast O(1/n) offline regret (including a DR version for unknown propensities), and enables safer, more robust policy improvement.
Abstract: Policy learning algorithms are regularly leveraged in domains such as personalized medicine and advertising
to develop individualized treatment regimes. However, a critical deficit of existing algorithms is that they force a decision even when predictions are uncertain, a risky approach in high-stakes settings. The ability to abstain, that is, to defer to a safe default or an expert, is crucial but largely unexplored in this context.
To remedy this, we introduce a framework for policy learning with abstention, in which policies that choose not to assign a treatment to some customers/patients receive a small, additive reward on top of the value of a random guess. We propose a two-stage learner that first identifies a set of near-optimal policies and then constructs an abstention class based on disagreements between the policies. We establish fast $O(1/n)$-type regret guarantees for the abstaining policy when propensities are known, and show how to extend these guarantees to the unknown-propensity case via a doubly robust (DR) objective. Furthermore, we demonstrate that our abstention framework is a versatile tool with direct applications to several other core problems in policy learning. We use our algorithm as a black box to obtain improved guarantees under margin conditions without the common realizability assumption. We also show that abstention provides a natural connection to both distributionally robust policy learning, where it acts as a hedge against small data shifts, and safe policy improvement, where the goal is to improve upon a baseline policy with high probability.
Code Dataset Promise: No
Signed Copyright Form: pdf
Format Confirmation: I agree that I have read and followed the formatting instructions for the camera ready version.
Submission Number: 1968
Loading