Abstract: Automatic generation of radiology reports has become an important task for medical AI, which aims to generate their corresponding textual descriptions from radiology images. This limitation arises from existing methods’ tendency to overlook core elements of radiologists’ diagnostic workflow, making it difficult to establish image–clinical history relevance and structurally disentangle disease semantics into diagnostic concepts and their manifestations, thereby compromising clinical coherence and diagnostic precision. To address this, we propose the Multi-Modality and Multi-Grained Transformer (MMMGT), a framework that incorporates elements of the radiologist’s diagnostic workflow by embedding clinical reasoning patterns into both the encoding and decoding processes. (1) In the encoding process, the Multi-Modality Semantic Encoder (MMSE) integrates visual features with clinical history embeddings via cross-modal attention, dynamically adjusting attention weights to focus on abnormal regions and mitigate visual bias. (2) The Multi-Grained Semantic Decoder (MGSD) generates structured topic–state pairs to mimic the hierarchical diagnostic process. It further incorporates a clinical alignment signal to enhance consistency with disease labels. Extensive evaluations on the IU-Xray and MIMIC-CXR datasets demonstrate the effectiveness of MMMGT in achieving clinical coherence and diagnostic precision, attaining state-of-the-art results with 0.520 BLEU-1 and 0.445 ROUGE-L (reflecting clinical narrative consistency) on IU-Xray, and a 0.411 F1 score (indicating diagnostic accuracy) for clinical efficacy on MIMIC-CXR. Ablation studies validate the contribution of each module.
External IDs:dblp:conf/icic/LiZZZFL25
Loading