DataSynK: Causal-Symbolic EHR Synthesis for Tabular Foundation Models in Low-Resource Settings

Published: 25 May 2026, Last Modified: 29 May 2026FMSD @ ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Synthetic Tabular Data, Tabular Foundation Models, Causal Discovery, Electronic Health Records, Low-Resource Settings, Neuro-symbolic AI
TL;DR: DataSynK is a causal-symbolic pipeline that integrates medical ontologies, tiered causal discovery, and ASP logic to synthesize biologically valid and highly predictive tabular EHR data.
Abstract: The chronic scarcity of labeled electronic health records (EHRs) limits the development of tabular foundation models, especially in Global South settings. While deep generative models collapse under extreme data scarcity, traditional structured generators fail to guarantee clinical plausibility. To address this, we propose DataSynK, a novel pipeline integrating causal discovery, prior medical ontology, and symbolic logic constraints to synthesize binary tabular EHRs. Empirical evaluations on real-world clinical data demonstrate that DataSynK prevents mode collapse in low-resource regimes and uniquely achieves full ontological validity. Furthermore, it significantly improves downstream predictive utility for imbalanced classes compared to purely statistical baselines, establishing a robust framework for knowledge-guided synthetic data generation.
Submission Number: 188
Loading