CareBench: A Comprehensive Benchmark for Accuracy, Robustness, and Fairness in Multimodal Fusion of EHR and Chest X-Rays

ICLR 2026 Conference Submission12952 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multimodal fusion, clinical multimodal learning, electronic health records, medical imaging
Abstract: Machine learning holds great promise for advancing clinical decision support, yet multimodal models remain difficult to translate due to missing modalities and fairness concerns. We present CareBench, a comprehensive benchmark for evaluating accuracy, robustness, and fairness in multimodal fusion of Electronic Health Records (EHR) and chest X-rays (CXR), built on standardized cohorts from MIMIC-IV and MIMIC-CXR. CareBench provides an open-source data pipeline, a unified modeling framework spanning unimodal and multimodal methods, and a rigorous evaluation protocol that extends beyond predictive accuracy. Our analyses reveal several important findings: multimodal fusion improves accuracy when modalities are complete, but benefits shrink under realistic missingness unless architectures are explicitly designed to handle partial inputs; performance varies across tasks, metrics, and architectures, with robustness emerging as a design-dependent property; and multimodality can exacerbate fairness disparities, particularly across admission types and age groups. By establishing the first benchmark that jointly evaluates accuracy, robustness, and fairness for clinical multimodal learning, CareBench lays the foundation for developing methods that are not only accurate but also reliable and equitable in real-world healthcare settings.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 12952
Loading