Constraint-Aware Tabular Variational AutoEncoder for Synthetic Data Generation in Health Domain

Published: 22 Sept 2025, Last Modified: 22 Sept 2025WiML @ NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Synthetic Data Generation, Healthcare AI, Variational AutoEncoders, Constraint Satisfaction, African Healthcare
Abstract: Healthcare AI development in resource-constrained regions, particularly in Africa, faces major challenges including limited access to quality datasets, privacy regulations, and inadequate data sharing infrastructures. This constitutes a critical gap for data-driven healthcare innovation. Synthetic data generation emerges as a promising solution, yet existing approaches like Tabular Variational AutoEncoder (TVAE) focus primarily on statistical similarity between real data and synthetic data without including domain-specific logical consistency, resulting in statistically similar data but medically irrelevant outcomes. We propose Constraint-Aware TVAE (CA-TVAE), an extension of traditional TVAE that incorporates domain-specific constraints through post-processing validation and correction. Our approach defines constraints as logical functions C = {c1, c2, ..., cn} where ci(x) = 1 indicates constraint satisfaction. We implemented eight cardiovascular domain-specific relational constraints including: Blood Pressure Grade Constraints, Diabetes-Related Constraints, Lifestyle Consistency, and Socioeconomic Logic. The methodology involves training a standard TVAE model, generating synthetic data, validating against defined constraints, and applying corrective post-processing to ensure constraint satisfaction. We evaluated our approach using cardiovascular survey data from northern Senegal (n=911, 168 features after preprocessing). Statistical fidelity was assessed using Kolmogorov-Smirnov tests for numerical features and χ² tests for categorical features. Machine learning utility was evaluated using the Train-Synthetic-Test-Real (TSTR) framework with Random Forest classifiers. Statistical Fidelity: 26/43 numerical features and 100/125 categorical features passed distribution similarity tests.  ML Utility: CA-TVAE achieved 97.1% AUC compared to 97.4% on original data and 94.2% with standard TVAE. Constraint Satisfaction: 100% compliance with all defined domain constraints post-processing. This work addresses critical healthcare data challenges in under-represented regions by providing a practical framework for generating medically consistent synthetic data. The constraint-aware approach ensures that synthetic cardiovascular data maintains not only statistical properties but also clinical logic, making it suitable for training robust AI models. This is particularly valuable for African healthcare systems where data scarcity significantly limits AI development. Our approach demonstrates that domain expertise can be effectively integrated into synthetic data generation, paving the way for more reliable AI applications in resource-constrained healthcare environments. The methodology is generalizable to other medical domains and geographical contexts where similar data challenges exist. Future work will focus on automating constraint discovery and developing more sophisticated constraint integration mechanisms within the TVAE architecture rather than through post-processing alone.
Submission Number: 166
Loading