PedCLIP: A Vision-Language Model for Pediatric X-Rays with Mixture of Body Part Experts

Ta Duc Huy, Abin Shoby, Sen Kim Tran, Yutong Xie, Qi Chen, Phi Le Nguyen, Akshay Gole, Lingqiao Liu, Antonios Perperidis, Mark Friswell, Rebecca Linke, Andrea Glynn, Minh-Son To, Anton van den Hengel, Johan Verjans, Zhibin Liao, Minh Hieu Phan

Published: 2025, Last Modified: 07 May 2026MICCAI (5) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Vision-language models have demonstrated remarkable success in general medical image analysis, yet their application in pediatric imaging remains significantly underexplored. These models show limited performance on pediatric datasets, primarily due to domain gaps stemming from anatomical differences, lower radiation doses, and pediatric-specific diseases. To this end, we present the first pediatric vision-language pre-training framework, dubbed PedCLIP, trained on a comprehensive pediatric imaging dataset comprising 404,670 X-rays of pediatric patients across diverse anatomical regions. To address anatomical diversity, we introduce a Mixture of Body part Experts design, with each expert specializing in learning features from distinct anatomical regions. Experimental evaluation across eleven downstream tasks demonstrates that our model significantly outperforms current state-of-the-art vision-language models, achieving superior diagnostic accuracy in challenging pediatric conditions, including rare diseases such as pediatric inflammatory arthritis. Code is available: https://github.com/tadeephuy/PedCLIP

External IDs:dblp:conf/miccai/HuySTXCNGLPFLGTHVLP25