SynEHRgy: Synthesizing Mixed-Type Structured Electronic Health Records using Decoder-Only Transformers

Published: 12 Oct 2024, Last Modified: 17 Dec 2024GenAI4Health PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: EHR, synthetic data generation, Transformers, irregularly-sampled time series
TL;DR: Generating high-quality synthetic EHRs with a novel tokenization strategy to improve data privacy and machine learning using a GPT-like model
Abstract: Generating synthetic Electronic Health Records (EHRs) offers significant potential for data augmentation, privacy-preserving data sharing, and enhancing machine learning model training. We propose a novel tokenization strategy tailored for structured EHR data, which encompasses diverse data types such as covariates, ICD codes, and irregularly sampled time series. Utilizing a GPT-like decoder-only transformer model, we demonstrate the generation of high-quality synthetic EHRs. Our approach is evaluated using the MIMIC-III dataset, and we benchmark the fidelity, utility, and privacy of the generated data against state-of-the-art models.
Submission Number: 14
Loading