Synthetic Electronic Health Record Generation of Rare Disease With Reinforcement Learning

Published: 07 Mar 2025, Last Modified: 25 Mar 2025GenAI4Health PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: language model; foundation model; reinforcement learning
Abstract: The generation of synthetic electronic health record (EHR) data using large foundation models (FMs) holds immense potential for mitigating data scarcity in healthcare, particularly in addressing the critical challenge of modeling rare diseases. However, the inherent imbalance in EHR data, where rare diseases are underrepresented, limits the ability of FMs to accurately generate these crucial data samples. This quality gap affects the usability of synthetic data in downstream applications, such as predictive modeling for rare diseases. To tackle this challenge, we propose Reinforcement Learning with Target Feedback (RLTF), a reinforcement learning-based framework designed to fine-tune FMs specifically for generating high-quality synthetic EHR data. By leveraging Direct Preference Optimization (DPO), RLTF optimizes the generative model to favor sequences that closely replicate real-world patterns of rare disease groups, ensuring their accurate representation. Experimental results demonstrate that RLTF significantly outperforms base model and other state-of-the-art methods in generating rare diagnostic codes and improves the utility of synthetic data for downstream tasks, such as rare disease prediction.
Submission Number: 17
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview