Learning from Inside: Self-driven Intra-modality Siamese Knowledge Generation and Inter-modality Alignment for Chest X-rays Vision-Language Pre-training

Lihong Qiao, Long Cheng, Yucheng Shu, Xiao Luan, Bin Xiao

Published: 01 Jan 2024, Last Modified: 06 Mar 2025BIBM 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Since pathology occupies only a small portion of an X-ray, which means that a large portion of the information may be irrelevant to the paired radiology report, the Chest X-rays Report Understanding (CRU) task focuses on how to utilize small regions of the case to improve the performance of medical VLP. However, existing studies have neglected the fine-grained false negative samples of medical visual representations, resulting in their poor performance in CRU scenarios, which we attribute this to the fine-grained feature collapse problem. To address this issue, we propose an intra-modality siamese knowledge generation and inter-modality alignment framework, termed Chest X-rays Report Understanding Framework(CRUF). CRUF leverages the siamese knowledge in image-text pairs as guiding signals to distinguish fine-grained false negative and negative samples within the modality, and further narrows the distance between false negative and positive samples between modalities, accurately aligning the case regions of each image with the corresponding medical terms. Experimental results on multiple downstream medical image datasets covering tasks such as image classification, object detection, and semantic segmentation demonstrate the stability and outstanding performance of our framework. Code is available at https://github.com/cl-red/CRUF.