Are Medical Vision–Language Foundation Models Ready for Dermatology

ICLR 2026 Conference Submission675 Authors

01 Sept 2025 (modified: 23 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Skin Imaging Analysis, Machine Learning for Healthcare, Medical Foundation Models
Abstract: Medical Vision-Language models (VLMs) show significant promise for clinical image understanding, offering the promise of greater medical accessibility and interpretability. However, a critical performance gap in diagnostic accuracy exists between their strong vision encoders and the full multimodal model. This performance gap suggests that such VLM fails to make full use of the strength of its vision branch. Such misalignment also implies that these models often over-rely on their language priors, producing plausible-sounding diagnoses without sufficiently grounding their reasoning in visual evidence. Focusing on dermatology, we systematically investigate the root causes of this phenomenon. While fine-tuning can improve accuracy, it often compromises the model's essential reasoning capabilities. To address these challenges, we introduce a training-free inference pipeline designed to close the performance gap while preserving the model's reasoning abilities. Our pipeline enhances diagnostic accuracy and faithfulness without requiring additional training. These strategies are readily extensible, suggesting a path toward more reliable and interpretable VLMs in medicine and beyond.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 675
Loading