CMT: Cross-modal Memory Transformer for Medical Image Report Generation

Published: 01 Jan 2023, Last Modified: 13 Nov 2025DASFAA (3) 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Automatic medical image report generation has attracted extensive research interest in medical data mining, which effectively alleviates doctors’ workload and improves report standardization. The mainstream approaches adopt the Transformer-based Encoder-Decoder architecture to align the visual and linguistic features. However, they rarely consider the importance of cross-modal interaction (e.g., the interaction between images and reports) and do not adequately explore the relations between multi-modal medical data, leading to inaccurate and incoherent reports. To address these issues, we propose a Cross-modal Memory Transformer model (CMT) to process multi-modal medical data (i.e., medical images, medical terminology knowledge, and medical report text), and leverage the relations between multi-modal medical data to generate accurate medical reports. To explore the interaction of cross-modal information, we design a novel cross-modal feature memory decoder to memorize the relations between image and report features. Furthermore, the multi-modal feature fusion module in CMT exploits the multi-modal medical data to adaptively measure the contribution of multi-modal features for word generation, which improves the accuracy of generated reports. Extensive experiments on three real datasets demonstrate that our proposed CMT outperforms benchmark methods on automatic metrics.
Loading