EHRDiff : Exploring Realistic EHR Synthesis with Diffusion Models

Published: 07 Apr 2024, Last Modified: 07 Apr 2024Accepted by TMLREveryoneRevisionsBibTeX
Abstract: Electronic health records (EHR) contain a wealth of biomedical information, serving as valuable resources for the development of precision medicine systems. However, privacy concerns have resulted in limited access to high-quality and large-scale EHR data for researchers, impeding progress in methodological development. Recent research has delved into synthesizing realistic EHR data through generative modeling techniques, where a majority of proposed methods relied on generative adversarial networks (GAN) and their variants for EHR synthesis. Despite GAN-based methods attaining state-of-the-art performance in generating EHR data, these approaches are difficult to train and prone to mode collapse. Recently introduced in generative modeling, diffusion models have established cutting-edge performance in image generation, but their efficacy in EHR data synthesis remains largely unexplored. In this study, we investigate the potential of diffusion models for EHR data synthesis and introduce a novel method, EHRDiff. Through extensive experiments, EHRDiff establishes new state-of-the-art quality for synthetic EHR data, protecting private information in the meanwhile.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url:
Changes Since Last Submission: Major Revision following the comments of AE and reviewers from the last time submission. This resubmission is according to the AE's advice in the last submission. The revision includes the following: - Revise the contribution of the paper, and change the position of our paper to 'one of the first works' on diffusion models for EHR synthesis. - Revise the presentation of the methodology, including moving the introduction of general diffusion models to the background. - Adding standard errors for the numerical results in the papers. - Moving the results on ECG and Cinc2012 data results to the main results and revising the discussion as requested. - Add the ablation studies of model designs and scaling performance of synthetic sample size as advised by reviewers from the last submission. These additional results formulate a new discussion section as requested by AE and reviewers. - Revise the grammatical errors, reference, and rephrase the ambiguous expressions. Update according to the comments from Reviewer F7A1 in 12/19/2023: - Revise the paper in terms of wording and punctuation. - Adding a new section discussing the broader impacts of the EHR synthesis model in Section 6. Update according to the final decision in 03/17/2024: - Update the Camera Ready version of the manuscript. - We thank all the reviewers and editors' comments and efforts.
Supplementary Material: zip
Assigned Action Editor: ~Krzysztof_Jerzy_Geras1
Submission Number: 1574