Deep Survival Analysis from Adult and Pediatric Electrocardiograms: A Multi-center Benchmark Study

Platon Lukyanenko, Joshua Mayourian, Mingxuan Liu, John K Triedman, Sunil J Ghelani, William La Cava

Published: 17 Dec 2025, Last Modified: 14 Apr 2026BioData MiningEveryoneCC BY 4.0

Abstract: Background Artificial intelligence applied to electrocardiography (AI-ECG) has recently shown potential for mortality prediction, but heterogeneous approaches and private datasets have limited generalizable insights into AI methodologies fit for this purpose. To address this, we systematically evaluated model design choices across three large medical center cohorts: Beth Isreal Deaconess (MIMIC-IV: n = 795,546 ECGs, United States), Telehealth Network of Minas Gerais (Code-15; n = 345,779, Brazil), and Boston Children’s Hospital (BCH; n = 255,379, United States). Results We comprehensively evaluates models to predict all-cause mortality, comparing horizon-based classification and deep survival methods various neural architectures including convolutional neural networks and transformers. We also benchmarked against demographic-only and gradient boosting baselines. Top models yielded good performance (median concordance, Code-15: 0.83; MIMIC-IV: 0.78; BCH: 0.81). Incorporating age and sex improved performance across all datasets. Classifier-Cox models exhibited site-dependent sensitivity to horizon choice (median Pearson’s R, Code-15: 0.35; MIMIC-IV: −0.71; BCH: 0.37). External validation reduced concordance, and in some cases, demographic-only models outperformed externally trained AI-ECG models on Code-15. However, models trained on multi-site data outperformed site-specific models by margins ranging from 5% to 22%. Conclusions These findings highlight several key factors for robust AI-ECG deployment. Deep survival methods consistently provided advantages over horizon-based classifiers, while inclusion of demographic covariates such as age and sex improved predictive performance across sites. The sensitivity of classifier-based models to horizon selection underscores the need for site-specific calibration. The multi-site experiment reveals that cross-cohort training, even between adult and pediatric cohorts, can substantially improve performance on those cohorts compared to cohort-specific training. Together, these results emphasize the importance of model type, demographic features, and training data diversity in developing AI-ECG models that can be reliably applied across populations.