Empirical Analysis of Scaling Vision Foundation Models for Chest X-rays

Ahmed Al Mahrooqi; Prateek Munjal; Ronnie Rajan; Marco AF Pimentel; Praveenkumar Kanithi

Empirical Analysis of Scaling Vision Foundation Models for Chest X-rays

Ahmed Al Mahrooqi, Prateek Munjal, Ronnie Rajan, Marco AF Pimentel, Praveenkumar Kanithi

Published: 27 Mar 2025, Last Modified: 09 May 2025MIDL 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision Foundation Models, Chest X-ray, self supervised learning

Abstract: Recent advancements in multimodal transformers have shown remarkable success in computer vision and natural language tasks, yet their adaptation to the clinical world remains challenging. We introduce CXformer, a vision transformer adapted for chest X-ray analysis, through systematic investigation of architectural choices and training modifications from DINOv2. Our empirical results show that using registers in ViT training, centering the teacher model's softmax outputs, and optimizing the number of heads leads to better performance. The small version of CXformer(S) (22M parameters) achieves 83.28% mean AUROC on CheXpert test set, surpassing the baseline of 80.46% achieved with vanilla DINOv2 settings. Contrary to common assumptions, our larger model CXformer(B) with 87M parameters shows similar performance at 84% mean AUROC on CheXpert, suggesting that training optimizations matter more than model size. Furthermore compared to the current state-of-the-art RAD-DINO, our CXformer(B), with 46% reduced pretraining compute (in FLOPs) achieves an average AUROC of 87.93% (vs 87.32% by RAD-DINO) on pathology image classification task evaluated across three widely used CXR datasets i.e. CheXpert, RSNA Pneumonia, and NIH CXR8. Beyond classification, CXformer also delivers competitive, and occasionally superior, performance in semantic segmentation and radiology report generation, underscoring its versatility. CXformer base and small models can be found at https://huggingface.co/m42-health

Primary Subject Area: Foundation Models

Secondary Subject Area: Unsupervised Learning and Representation Learning

Paper Type: Methodological Development

Registration Requirement: Yes

Reproducibility: https://github.com/m42-health/CXformer

Visa & Travel: Yes

Latex Code: zip

Copyright Form: pdf

Submission Number: 155

Loading