RaceCLIP: Enhancing medical vision-language representation learning via retrieval augmented caption enrichment

RaceCLIP: Enhancing medical vision-language representation learning via retrieval augmented caption enrichment

ICLR 2026 Conference Submission18252 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: CLIP, RAG

Abstract: Contrastive Language-Image Pre-training (CLIP) has demonstrated strong potential in learning transferable visual models by aligning paired images and textual descriptions. However, the quality of training data remains a significant bottleneck. In many real-world scenarios, image-text pairs are often noisy or accompanied by captions that are too short or generic to convey key visual attributes. For example, in medical imaging, most available data come from illustrative figures in public literature instead of detailed clinical reports, resulting in captions that lack the precision and context provided by expert annotations. Recent efforts to improve caption quality using Large Language Models (LLMs) have largely focused on natural images and overlook the integration of domain-specific knowledge. In this study, we propose a Retrieval-Augmented Generation (RAG) framework guided by expert semantic knowledge to enrich image captions in the medical context. We further introduce a multi-text training strategy that effectively incorporates these enriched descriptions into CLIP training. Our approach, demonstrated in the medical domain as a proof of concept, achieves state-of-the-art performances on multiple downstream tasks, highlighting its broader potential for vision-language pretraining in specialized domains. Our code is available at https://anonymous.4open.science/r/RaceCLIP-D4C5.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 18252

Loading