Rethinking Contrastive Language-Image Pre-Training for Medical Cross-Modal Retrieval: Beyond One-to-One Correspondence

Rethinking Contrastive Language-Image Pre-Training for Medical Cross-Modal Retrieval: Beyond One-to-One Correspondence

ICLR 2026 Conference Submission17308 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: cross-modal retrieval, CLIP, contrastive learning, label smoothing

Abstract: Cross-modal retrieval using contrastive language-image pre-training (CLIP) has achieved remarkable success, and also in medical applications. While effective, current CLIP-based approaches for medical image-report retrieval overlook critical differences between medical and natural image-text pairs. Unlike concise natural image captions, medical reports are long, multi-faceted descriptions of their paired images. Furthermore, similar pathological patterns frequently recur across different medical cases. These characteristics challenge CLIP's image-text alignment paradigm, which struggles with lengthy reports and ignores inter-case similarities. To address these limitations, we propose two innovations: HIP-InfoNCE, a contrastive loss that aligns holistic images with multiple stochastic masked views of their corresponding reports, and text-aware label smoothing, which incorporates inter-report semantic similarity into supervision. Extensive experiments demonstrate that our approach outperforms existing methods by significant margins and achieves state-of-the-art performance.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 17308

Loading