Keywords: treatment effect estimation, causal inference, LLM for structured data, data augmentation, covariate shift
TL;DR: We demonstrate that principled LLM-based data augmentation has the potential to improve CATE estimation performance in small-sample settings.
Abstract: We introduce $\texttt{GATE}$, a framework which improves conditional average treatment effects (CATE) estimation in small-sample regimes. Our framework augments datasets with synthetic _counterfactual_ outcomes using _pre-trained_ generative models. Doing so addresses the covariate shift problem when inferring CATE from observational data. By using pre-trained generative models, $\texttt{GATE}$ augments downstream CATE models with knowledge _beyond the training data_. In particular, we instantiate $\texttt{GATE}$ with large language models (LLMs), which we show to work exceptionally well. LLMs utilise rich contextual information, such as dataset metadata, to generate outcomes grounded in real-world contexts. We demonstrate, both theoretically and empirically, that restricting augmentation to a carefully chosen subset of the covariate space can achieve performance gains—_even with imperfect generated outcomes._
Submission Number: 82
Loading