General or Medical CLIP, Which one Shall We Choose?
Keywords: Foundation Model, Classification, Visual-question answering, Image-to-text retrieval, medical applications
Abstract: Despite growing interest in multimodal deep learning for medical imaging, researchers and clinicians still lack a systematic understanding of when medical-domain vision–language models outperform general-domain counterparts. While several medical CLIP variants have been proposed, they are typically evaluated in isolation and on narrow tasks, leaving open questions about how pre-training data, downstream task, and fine-tuning strategy jointly affect performance. We systematically compare four vision–language CLIP-based models on three representative tasks: image classification, image-to-text retrieval%(ROCO, radiology, and non-radiology), and visual question answering. Across tasks, zero-shot performance is generally insufficient for clinical use, even for medically pre-trained models, confirming the need for task-specific fine-tuning. Medical-domain pre-training offers clear benefits in low-data regimes and for in-distribution modalities, but can underperform CLIP when downstream data deviates from the pre-training distribution. When sufficient labeled data is available, and especially under LoRA-based tuning, general-domain CLIP systematically matches or surpasses specialized medical models. VQA remains notably challenging, with none of the evaluated models achieving competitive results even after fine-tuning, suggesting that more advanced multimodal reasoning approaches are needed. Based on these findings, we provide recommendations for selecting and adapting vision-language models in clinical settings.
Primary Subject Area: Foundation Models
Secondary Subject Area: Transfer Learning and Domain Adaptation
Registration Requirement: Yes
Reproducibility: https://github.com/98haiting/General-or-Medical-CLIP-Which-one-Shall-We-Choose
Visa & Travel: Yes
Read CFP & Author Instructions: Yes
Originality Policy: Yes
Single-blind & Not Under Review Elsewhere: Yes
LLM Policy: Yes
Submission Number: 129
Loading