Breaking the Barrier of Hard Samples: A Data-Centric Approach to Synthetic Data for Medical Tasks

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Data scarcity and quality issues remain significant barriers to developing robust predictive models in medical research. Traditional reliance on real-world data often leads to biased models with poor generalizability across diverse patient populations. Synthetic data generation has emerged as a promising solution, yet challenges related to these sample's representativeness and effective utilization persist. This paper introduces Profile2Gen, a novel data-centric framework designed to guide the generation and refinement of synthetic data, focusing on addressing hard-to-learn samples in regression tasks. We conducted approximately 18,000 experiments to validate its effectiveness across six medical datasets, utilizing seven state-of-the-art generative models. Results demonstrate that refined synthetic samples can reduce predictive errors and enhance model reliability. Additionally, we generalize the DataIQ framework to support regression tasks, enabling its application in broader contexts. Statistical analyses confirm that our approach achieves equal or superior performance compared to models trained exclusively on real data.
Lay Summary: Healthcare researchers often face the fundamental problem of never having enough patient data to train AI models properly. Privacy laws and data collection challenges restrict the quantity of available data. It is like teaching someone to drive using only five driving lessons. Our work helps address this by creating synthetic patient data that mimics real data. Previous research has shown that some examples are harder for AI models to learn than others. We developed Profile2Gen to generate synthetic data considering, identifying, and treating those "difficult cases". Instead of only copying sample patterns, our approach refines the data through multiple steps to improve its quality. We tested this across 18,000 experiments using real medical datasets. The result? We found that our generated data improves model accuracy and, in some cases, performs better than using only real data. Our approach enables researchers and clinicians to train more reliable AI models, even in data-scarce situations, supporting better diagnostics and personalized treatments.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/szanara/profile2gen.git
Primary Area: Applications->Health / Medicine
Keywords: Medical tasks, synthetic data, data-centric
Submission Number: 13807
Loading