RaceCLIP: Enhancing medical vision-language representation learning via retrieval augmented caption enrichment
Keywords: CLIP, RAG
Abstract: Contrastive Language-Image Pre-training (CLIP) has demonstrated strong potential in learning transferable visual models by aligning paired images and textual descriptions. However, the quality of training data remains a significant bottleneck. In many real-world scenarios, image-text pairs are often noisy or accompanied by captions that are too short or generic to convey key visual attributes. For example, in medical imaging, most available data come from illustrative figures in public literature instead of detailed clinical reports, resulting in captions that lack the precision and context provided by expert annotations. Recent efforts to improve caption quality using Large Language Models (LLMs) have largely focused on natural images and overlook the integration of domain-specific knowledge. In this study, we propose a Retrieval-Augmented Generation (RAG) framework guided by expert semantic knowledge to enrich image captions in the medical context. We further introduce a multi-text training strategy that effectively incorporates these enriched descriptions into CLIP training. Our approach, demonstrated in the medical domain as a proof of concept, achieves state-of-the-art performances on multiple downstream tasks, highlighting its broader potential for vision-language pretraining in specialized domains. Our code is available at https://anonymous.4open.science/r/RaceCLIP-D4C5.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 18252
Loading