Generating High-Fidelity Privacy-Conscious Synthetic Patient Data for Causal Effect Estimation with Multiple TreatmentsDownload PDF

Published: 28 Jan 2022, Last Modified: 13 Feb 2023ICLR 2022 SubmittedReaders: Everyone
Keywords: synthetic data, causal inference, EHR, healthcare, deep generative modeling, treatment effects, model validation, observational patient data, patient privacy
Abstract: A causal effect can be defined as the comparison of outcomes from two or more alternative treatments. Knowing this treatment effect is critically important in healthcare because it makes it possible to identify the best treatment for a person when more than one option exists. In the past decade, there has been exponentially growing interest in the use of observational data collected as a part of routine healthcare practice to determine the effect of a treatment with causal inference models. Validation of these models, however, has been a challenge because the ground truth is unknown: only one treatment-outcome pair for each person can be observed. There have been multiple efforts to fill this void using synthetic data where the ground truth can be generated. However, to date, these datasets have been severely limited in their utility either by being modeled after small non-representative patient populations, being dissimilar to real target populations, or only providing known effects for two cohorts (treated vs control). In this work, we produced a large-scale and realistic synthetic dataset that supports multiple hypertension treatments, by modeling after a nationwide cohort of more than 250,000 hypertension patients' multi-year history of diagnoses, medications, and laboratory values. We designed a data generation process by combining an adapted ADS-GAN model for fictitious patient information generation and a neural network for treatment outcome generation. Wasserstein distance of 0.35 demonstrates that our synthetic data follows a nearly identical joint distribution to the patient cohort used to generate the data. Our dataset provides ground truth effects for about 30 hypertension treatments on blood pressure outcomes. Patient privacy was a primary concern for this study; the $\epsilon$-identifiability metric, which estimates the probability of actual patients being identified, is 0.008%, ensuring that our synthetic data cannot be used to identify any actual patients. Using our dataset, we tested the bias in causal effect estimation of three well-established models: propensity sore stratification, doubly robust approach (DR) with logistic regression, DR with random forest (RF) classification. Interestingly, we found that while the RF DR outperformed the logistic DR as expected, the best performance actually came from propensity score stratification, despite the theoretical strength of statistical properties of the DR family of models. We believe this dataset will facilitate the additional development, evaluation, and comparison of real-world causal models. The approach we used can be readily extended to other types of diseases in the clinical domain, and to datasets in other domains as well.
One-sentence Summary: In this work, we produce a large-scale and realistic synthetic patient dataset with ground truth for treatment effects to validate causal inference models.
Supplementary Material: zip
7 Replies

Loading