An interpretable data augmentation framework for improving generative modeling of synthetic clinical trial data

Published: 20 Jun 2023, Last Modified: 19 Jul 2023IMLH 2023 OralEveryoneRevisionsBibTeX
Keywords: synthetic data, clinical trials, privacy, data augmentation, machine learning
TL;DR: High quality data augmentation for clinical trial training datasets can improve the utility and fidelity of synthetic datasets generated using generative models while maintaining low privacy risk.
Abstract: Synthetic clinical trial data are increasingly being seen as a viable option for research applications when primary data are unavailable. A challenge when applying generative modeling approaches for this purpose is many clinical trial datasets have small sample sizes. In this paper, we present an interpretable data augmentation framework for improving generative models used to produce synthetic clinical trial data. We apply this framework to three clinical trial datasets spanning different disease indications and evaluate the impact of factors such as initial dataset size, generative algorithm, and augmentation scale on metrics used to assess synthetic clinical trial data quality, including fidelity, utility, and privacy. The results indicate that this framework can considerably improve the quality of synthetic data produced using generative algorithms when considering factors of high interest to end users of synthetic clinical trial data.
Submission Number: 86
Loading