Synthetic Health-related Longitudinal Data with Mixed-type Variables Generated using Diffusion Models

Published: 30 Oct 2023, Last Modified: 30 Nov 2023SyntheticData4ML 2023 PosterEveryoneRevisionsBibTeX
Keywords: synthetic data; generative adversarial networks; diffusion models; electronic health records; mixed-typed dataset; time-series dataset
TL;DR: We present a new method using Diffusion Probabilistic Models to create realistic, simulated Electronic Health Records that outperform existing GAN-based methods and are effective for training other machine learning algorithms in clinical scenarios.
Abstract: This paper introduces a novel method for simulating Electronic Health Records (EHRs) using Diffusion Probabilistic Models (DPMs). We showcase the ability of DPMs to generate longitudinal EHRs with mixed-type variables – numeric, binary, and categorical. Our approach is benchmarked against existing Generative Adversarial Network (GAN)-based methods in two clinical scenarios: management of acute hypotension in the intensive care unit and antiretroviral therapy for people with human immunodeficiency virus. Our DPM-simulated datasets not only minimise patient disclosure risk but also outperform GAN-generated datasets in terms of realism. These datasets also prove effective for training downstream machine learning algorithms, including reinforcement learning and Cox proportional hazards models for survival analysis.
Submission Number: 7