TL;DR: Data valuation for selection under limited budget for treatment effect estimation
Abstract: Although numerous complex algorithms for treatment effect estimation have been developed in recent years, their effectiveness remains limited when handling insufficiently labeled training sets due to the high cost of labeling the post-treatment effect, e.g., the expensive tumor imaging or biopsy procedures needed to evaluate treatment effects. Therefore, it becomes essential to actively incorporate more high-quality labeled data, all while adhering to a constrained labeling budget. To enable data-efficient treatment effect estimation, we formalize the problem through rigorous theoretical analysis within the active learning context, where the derived key measures -- factual and counterfactual covering radii determine the risk upper bound. To reduce the bound, we propose a greedy radius reduction algorithm, which excels under an idealized, balanced data distribution. To generalize to more realistic data distributions, we further propose FCCM, which transforms the optimization objective into the Factual and Counterfactual Coverage Maximization to ensure effective radius reduction during data acquisition. Furthermore, benchmarking FCCM against other baselines demonstrates its superiority across both fully synthetic and semi-synthetic datasets. Code: https://github.com/uqhwen2/FCCM.
Lay Summary: Existing literature bears the assumption of enough training data for the treatment effect estimation, which could be breached under the low-data regime. Thus, we are inspired to extend the dataset while adhering to a constrained budget to cater for numerous downstream complex algorithms for effect estimation.
To use the limited budget most efficiently for extending the training dataset, we build a theoretical framework and identify the most related quantities to value each of the potential data point to be acquired. Then, given the vast unlabeled pool set, we rank the unlabeled data by their value based on our theory and acquire them in the order from the most-valued to the least-valued within the monetary budget.
Our work entails many potential societal benefits, including enhanced accuracy in treatment estimation for new drugs and improved robustness in causal inference models under data sparsity.
Primary Area: General Machine Learning->Causality
Keywords: Causal Inference; Treatment Effect Estimation; Active Learning
Submission Number: 2979
Loading