A Fairness Audit of Medical Imaging Foundation Models on a Multimodal Structured Clinical Benchmark

Published: 23 May 2026, Last Modified: 23 May 2026SD4H ICML 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Fairness in Machine Learning, Medical Imaging, Foundation Models, Multimodal Learning Healthcare, Bias & Disparities, Representation Learning, Adversarial Debiasing, Health Equity, AI Safety in Healthcare
TL;DR: We benchmark medical imaging foundation models for PE diagnosis, revealing spatial representation limits, negligible multimodal fusion gains, and severe age-related underdiagnosis that adversarial debiasing reduces with minimal performance trade-off.
Abstract: We conduct the first systematic fairness audit of three medical imaging foundation models (MedImageInsight, MedSigLIP, and BiomedCLIP) on INSPECT, a multimodal structured benchmark pairing CTPA imaging with longitudinal EHR for pulmonary embolism (PE) diagnosis and seven prognostic tasks. All three frozen encoders fall below the CT-LRCN baseline on PE diagnosis (AUROC 0.680–0.684 vs. 0.721). Our primary finding is that age is the dominant and previously unreported disparity dimension on INSPECT: patients aged 18–40 have underdiagnosis rates (UDR) of 0.63–0.80 versus 0.31–0.41 for ages 75–90, with MedSigLIP and BiomedCLIP reaching near-chance AUROC (0.508) for younger patients. This gap exceeds race/ethnicity and gender disparities and persists across all eight tasks. Age targeted adversarial debiasing, the only strategy that reduces gaps without substantially hurting AUROC, cuts MedImageInsight’s age gap by 79% (0.333→0.069; p<0.001) at only 0.011 AUROC cost, establishing a practical mitigation path for high-capacity encoders.
Submission Number: 142
Loading