Cross-modal retrieval of chest X-ray images and diagnostic reports based on report entity graph and dual attention

Weihua Ou, Yingjie Chen, Linqing Liang, Jianping Gou, Jiahao Xiong, Jiacheng Zhang, Lingge Lai, Lei Zhang

Published: 01 Jan 2025, Last Modified: 23 Apr 2025Multim. Syst. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Cross-modal retrieval for chest X-ray images and diagnostic reports is the automated process of fetching reports or related images from an extensive medical records database using specific queries. Current methods for cross-modal retrieval of chest X-ray images and diagnostic reports often struggle with the fine-grained semantic representation alignment between chest X-ray images and text reports. However, these methods do not use the structured information in reports well, making it hard to interpret and align the image and text representation effectively, leading to poor retrieval performance. In this paper, we propose a novel framework for cross-modal retrieval of chest X-ray images and diagnostic reports based on a report entity graph and dual attention mechanisms. Specifically, we first employ the X-ray image encoder to extract fine-grained visual semantic representations of the X-ray images, and we use the report encoder to extract text features and anatomical entity features from the diagnostic reports. Moreover, a Graph Convolutional Neural Network (GCN) is utilized to effectively capture the semantic relationships among the report entities. To simulate the different levels of attention radiologists pay when reading chest X-ray images and reports, we designed a dual attention: intra-attention and inter-attention. The intra-attention mechanism is proposed to fuse global features with local representations that contain fine-grained structures and rich details in images and reports, respectively. The inter-modal attention mechanism is designed to establish accurate connections between the medical report entities and the fine-grained representations of the chest radiograph images. Our method achieves approximately 6.2%, 3.9%, and 4.9% improvements in report retrieval tasks and 7.4%, 4.0%, and 5.5% improvements in X-ray image retrieval tasks over the latest HSR method on three commonly used chest radiograph datasets.