Keywords: Synthetic Clinical Data, Electronic Health Records, Causal Discovery, Trustworthy AI, Structural Validity, Neuro-Symbolic AI, Tabular Foundation Models, Low-Resource Machine Learning, Healthcare AI, Knowledge-Guided Generation
TL;DR: DataSynK is a causal-symbolic pipeline that integrates medical ontologies, tiered causal discovery, and ASP logic to synthesize biologically valid and highly predictive tabular EHR data.
Abstract: The scarcity of labeled EHRs limits clinical foundation models in low-resource settings. Existing synthetic data generators rely on statistical fidelity metrics that fail to capture clinical validity, often producing biologically implausible patient populations. We propose DataSynK, a causal-symbolic framework that integrates causal discovery, medical ontologies, and logical constraints to generate structurally valid synthetic EHRs. Experiments on Brazilian clinical data reveal a strong dissociation between statistical fidelity and clinical validity, showing that DataSynK achieves superior ontological validity and downstream classification utility. Our results suggest that structural validity should become a core evaluation criterion for trustworthy synthetic clinical data generation.
Submission Category: Full Paper
Overaged Verification: Yes
Latin American Hispanic Heritage: Yes
Icml Proceedings Status: No
Submission Number: 28
Loading