Emergent Global OOD Performance in Multimodal Mammography Models

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mammography, LMIC, Global Health, Fairness
TL;DR: As Multimodal VLM mammography models scale (trained on EMBED, very large diverse USA dataset), the largest models emergently perform better on 6 OOD eval sets from India, Vietnam, Saudi Arabia, China, Portugal, and the UK.
Abstract: Out-of-distribution (OOD) generalization is a central barrier to deploying mam- mography AI fairly across health systems, especially in LMICs. We test whether parameter scale alone, within one architecture family and without adding ex- tra images, yields emergent robustness in a CLIP-trained vision–language model (VLM). Using EMBED-open (22k patients; 480k 2D views), we train on cohort-1 only (11k patients; 240k images) a CLIP VLM whose ViT patch-16 image tower spans 6M, 22M, 86M, and 307M parameters; multimodality is limited to short, de-identified captions (sex, age, ethnicity, manufacturer, view) with a frozen biomedical text encoder, and downstream clinical targets are modeled via lin- ear probes on learned embeddings. We evaluate zero-shot on five international OOD cohorts—DMID (India), VinDr-Mammo (Vietnam), KAU-BCMD (Saudi Arabia), CMMD (China), and INBreast (Portugal). In-domain performance in- creases roughly linearly with log-parameters, whereas OOD performance shows a modest upward trend with a pronounced knee at the largest scale on several cohorts; gains are most visible at the most clinically relevant low-FPR operat- ing points (pAUC@0–10% and TPR@5% FPR). Achieved without extra data, task labels during training, or architectural changes, this pattern is consistent with emergent global OOD robustness from scale in multimodal mammography mod- els. Consequently, parameter scale should be a first-class design lever for globally deployable VLMs, and downsizing via smaller models, pruning, or quantization should be accompanied by OOD monitoring across international cohorts to avoid eroding this scale-driven robustness.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 24631
Loading