Multimodal multitask similarity learning for vision language model on radiological images and reports

Published: 17 Mar 2025, Last Modified: 12 Nov 2025OpenReview Archive Direct UploadEveryoneCC BY 4.0
Abstract: In recent years, large-scale Vision-Language Models (VLM) have shown promise in learning general representations for various medical image analysis tasks. However, current medical VLM methods typically employ contrastive learning approaches that have limited ability to capture nuanced yet crucial medical knowledge, particularly within similar medical images, and do not explicitly consider the uneven and complementary semantic information contained in different modalities. To address these challenges, we propose a novel Multimodal Multitask Similarity Learning (M2SL) method that learns joint representations of image-text pairs and captures the relational similarity between different modalities via a coupling network. Our method also notably leverages the rich information in the text inputs to construct a knowledge-driven semantic similarity matrix as the supervision signal. We conduct extensive experiments for cross-modal retrieval and zero-shot classification tasks on radiological images and reports and demonstrate substantial performance gains over existing methods. Our method also accommodates low-resource settings with limited training data availability and has significant implications for enhancing VLM development.
Loading