Enhancing the diagnosis of CVD and depression comorbidity through the augmentation of synthetic metabolomics data

Published: 19 Aug 2025, Last Modified: 12 Oct 2025BHI 2025EveryoneRevisionsBibTeXCC BY 4.0
Confirmation: I have read and agree with the IEEE BHI 2025 conference submission's policy on behalf of myself and my co-authors.
Keywords: CVD, depression, metabolomics, synthetic data augmentation, AI
TL;DR: A synthetic data augmentation pipeline to improve the diagnosis of cardiovascular disease and its comorbidity with depression, demonstrating superior AI model performance on imbalanced multimodal metabolomics data from the UK Biobank
Abstract: The cardiovascular disease (CVD) and depression comorbidity remains diagnostically challenging due to complex phenotypes and severe class imbalance in real-world health data. However, conventional resampling methods often fail to preserve the multimodal structure of high-dimensional clinical and metabolomics features. This study presents an AI-based pipeline applied to clinical and NMR-based metabolomics data from the UK Biobank to compare random downsampling against synthetic data augmentation using generative models like the conditional tabular generative adversarial network (CTGAN), the tabular variational autoencoder (TVAE), and the Tabular Denoising Diffusion Probabilistic Model (TabDDPM). The synthetic data produced by the CTGAN achieved the highest fidelity (Jensen-Shannon divergence 0.06 and average correlation difference 0.11 for the CVD diagnosis outcome). The AI models trained on synthetic data achieved superior performance across both classification tasks. For CVD diagnosis, the XGBoost reached 0.91 accuracy and 0.96 AUC, while for comorbid CVD and depression, 0.87 accuracy and 0.92 AUC. These results support synthetic augmentation as a robust solution to improve diagnostic performance across imbalanced datasets in healthcare.
Track: 4. Clinical Informatics
Registration Id: HGN2KLJCGNV
Submission Number: 59
Loading