Batch-Adaptive Causal Annotations

Published: 03 Feb 2026, Last Modified: 03 Feb 2026AISTATS 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We derive closed-form optimal annotation probability minimizing variance of ATE estimator with missing outcomes. We propose batch-adaptive procedure under budget constraints that improves ATE estimation. Validated on two real datasets.
Abstract: Estimating the causal effects of interventions is crucial to policy and decision-making, yet outcome data are often missing or subject to non-standard measurement error. While ground-truth outcomes can sometimes be obtained through costly data annotation or follow-up, budget constraints typically allow only a fraction of the dataset to be labeled. We address this challenge by optimizing \textit{which data points should be sampled for outcome information} in order to improve efficiency in average treatment effect estimation with missing outcomes. We derive a closed-form solution for the optimal sampling probability in batches. We optimize the asymptotic variance of a doubly-robust estimator for causal inference with missing outcomes, and show the resulting asymptotic convergence to the optimal variance. Motivated by a collaboration with a street outreach provider generating millions of case notes, we also extend this framework to costly annotations of unstructured data, such as text or images, common in healthcare and social services. Across simulated and real-world datasets, including one on outreach interventions in homelessness services, our approach achieves substantially lower mean-squared error and recovers the AIPW estimate with fewer labels than existing baselines. In practice, we show that our method can match confidence intervals obtained with 361 random samples using only 90 optimized samples—saving 75% of the labeling budget.
Submission Number: 2052
Loading