Generalizing Causal Effects from Randomized Controlled Trials to Target Populations across Diverse Environments

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Generalizing causal effects from Randomized Controlled Trials (RCTs) to target populations across diverse environments is of significant practical importance, as RCTs are often costly and logistically complex to conduct. A key challenge is environmental shift, defined as changes in the distribution and availability of covariates between source and target environments. A common approach addressing this challenge is to identify a separating set--covariates that govern both treatment effect heterogeneity and environmental differences--and combine RCT samples with target populations matched on this set. However, this approach assumes that the separating set is fully observed and shared across datasets, an assumption often violated in practice. We propose a novel Two-Stage Doubly Robust (2SDR) method that relaxes this assumption by allowing the separating set to be observed in only one of the two datasets. 2SDR leverages shadow variables to impute missing components of the separating set and generalize treatment effects across environments in a two-stage procedure. We show the identification of causal effects in target environments under 2SDR and demonstrate its effectiveness through extensive experiments on both synthetic and real-world datasets.
Lay Summary: Generalizing causal effects from Randomized Controlled Trials (RCTs) across diverse environments is challenging due to environmental shifts. A common solution to this challenge is combining and matching RCT data with observational data from the target population using a separating set. The problem with the common solution is that it relies on the assumption that the covariates shared by both groups contain the separating set, which is difficult to satisfy in real-world scenarios under experimental shifts. Our solution, Two-Stage Doubly Robust (2SDR), relaxes the assumption made in the common solution. It only assumes that variables from the separating set are present in at least one of the two data groups—either the RCT data or the observational data from the target population. 2SDR leverages automatically selected shadow variables to impute the missing covariates for generalizing treatment effects from RCTs across environments in a two-stage way. Both identifiability theory and extensive experimental evidence on synthetic and real-world datasets support the correctness and effectiveness of our solution.
Primary Area: General Machine Learning->Causality
Keywords: Causal Inference, Randomized Controlled Trial, Generalization, Treatment Effect Estimation, Data Fusion, Missing Covariates, Selection Bias
Submission Number: 5798
Loading