Abstract: When applying contextual bandit algorithms in high-stakes settings (e.g., medical treatment), practitioners rely on off-policy evaluation (OPE) methods that use historical data to evaluate the behavior of novel policies prior to deployment. Unfortunately, OPE techniques are inherently limited by the breadth of the available data, which may not reflect distribution shifts resulting from the application of a new policy. Recent work attempts to address this challenge by leveraging domain experts to increase dataset coverage by annotating counterfactual samples. However, such annotations are not guaranteed to be free of errors, and incorporating imperfect annotations can lead to worse policy value estimates than not using the annotations at all. To make use of imperfect annotations, we propose a family of OPE estimators based on the doubly robust (DR) principle, which combines importance sampling (IS) with a reward model (direct method, DM) for better statistical guarantees. We introduce three opportunities within the DR estimation framework to incorporate counterfactual annotations. Under mild assumptions, we prove that using annotations within just the DM component yields the most desirable results, providing an unbiased estimator even under noisy annotations. We validate our approaches in several settings, including a real-world medical domain, observing that the theoretical advantages of using annotations within just the DM component hold in practice under realistic conditions. By addressing the challenges posed by imperfect annotations, this work broadens the applicability of OPE methods and facilitates safer and more effective deployment of decision-making systems.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Alberto_Bietti1
Submission Number: 5517
Loading