Multi-Grained Vision-and-Language Model for Medical Image and Text Alignment

Huimin Yan, Xian Yang, Liang Bai, Jiamin Li, Jiye Liang

Published: 01 Jan 2025, Last Modified: 25 Jan 2026IEEE Transactions on MultimediaEveryoneRevisionsCC BY-SA 4.0

Abstract: The increasing interest in learning from paired medical images and textual reports highlights the need for methods that can achieve multi-grained alignment between these two modalities. However, most existing approaches overlook fine-grained semantic alignment, which can constrain the quality of the generated representations. To tackle this problem, we propose the Multi-Grained Vision-and-Language Alignment (MGVLA) model, which effectively leverages multi-grained correspondences between medical images and texts at different levels, including disease, instance, and token levels. For disease-level alignment, our approach adopts the concept of contrastive learning and uses medical terminologies detected from textual reports as soft labels to guide the alignment process. At the instance level, we propose a strategy for sampling hard negatives, where images and texts with the same disease type but differing in details such as disease locations and severity are considered as hard negatives. This strategy helps our approach to better distinguish between positive and negative image-text pairs, ultimately enhancing the quality of our learned representations. For token-level alignment, we employ a masking and recovery technique to achieve fine-grained semantic alignment between patches and sub-words. This approach effectively aligns the different levels of granularity between the image and language modalities. To assess the efficacy of our MGVLA model, we conduct comprehensive experiments on the image-text retrieval and phrase grounding tasks.

External IDs:doi:10.1109/tmm.2025.3590930