Semantic-Aware Hard Negative Mining for Medical Vision-Language Contrastive Pretraining

Yongxin Li, Ying Cheng, Yaning Pan, Wen He, Qing Wang, Rui Feng, Xiaobo Zhang

Published: 27 Oct 2025, Last Modified: 13 Nov 2025CrossrefEveryoneRevisionsCC BY-SA 4.0

Abstract: Existing medical vision-language contrastive pretraining methods aim to bring the paired image-report embeddings close together while pushing the unpaired ones apart. However, medical images often exhibit high inter-class visual similarity with only subtle differences, leading to the presence of hard negative samples that are semantically distinct from the anchor but incorrectly close to it in the embedding space, making it challenging to distinguish semantically dissimilar samples. Previous methods consider only the embedding similarity between samples to identify hard negatives, often wrongly treating false negatives as hard negatives. To address this issue, we design a simple yet effective approach called Semantic-Aware Hard Negative mining (SAHN), distinguishing hard negatives from false negatives and encouraging the model to pay greater attention to hard negatives. Specifically, hard negatives are identified as samples with high embedding similarity but low semantic similarity to the anchor and assigned greater importance weights. By integrating these importance weights into the InfoNCE loss, SAHN enhances the model's ability to separate semantically dissimilar samples while clustering semantically similar ones. We further conduct a gradient-based theoretical analysis to validate the effectiveness of SAHN. Extensive experimental results on four downstream medical tasks covering image classification, object detection, semantic segmentation, and cross-modal retrieval demonstrate the superiority of our approach.

External IDs:doi:10.1145/3746027.3754991