Improving Treatment Effect Estimation with LLM-Based Data Augmentation

Nicolas Huynh; Julianna Piskorz; Jeroen Berrevoets; Max Ruiz Luyten; Mihaela van der Schaar

Improving Treatment Effect Estimation with LLM-Based Data Augmentation

Nicolas Huynh, Julianna Piskorz, Jeroen Berrevoets, Max Ruiz Luyten, Mihaela van der Schaar

Published: 09 Jun 2025, Last Modified: 01 Jul 2025FMSD @ ICML 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: treatment effect estimation, causal inference, LLM for structured data, data augmentation, covariate shift

TL;DR: We demonstrate that principled LLM-based data augmentation has the potential to improve CATE estimation performance in small-sample settings.

Abstract: We introduce $\texttt{GATE}$, a framework which improves conditional average treatment effects (CATE) estimation in small-sample regimes. Our framework augments datasets with synthetic _counterfactual_ outcomes using _pre-trained_ generative models. Doing so addresses the covariate shift problem when inferring CATE from observational data. By using pre-trained generative models, $\texttt{GATE}$ augments downstream CATE models with knowledge _beyond the training data_. In particular, we instantiate $\texttt{GATE}$ with large language models (LLMs), which we show to work exceptionally well. LLMs utilise rich contextual information, such as dataset metadata, to generate outcomes grounded in real-world contexts. We demonstrate, both theoretically and empirically, that restricting augmentation to a carefully chosen subset of the covariate space can achieve performance gains—_even with imperfect generated outcomes._

Submission Number: 82

Loading