Optimizing medical image report generation through a discrete diffusion framework

Published: 01 Jan 2025, Last Modified: 11 Jun 2025J. Supercomput. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Medical imaging, such as X-rays and CT scans, plays a critical role in diagnostics, yet the growing workload leads to reporting delays and potential errors. Traditional deep learning-based approaches often struggled to capture complex semantic relationships in long medical reports, leading to issues of coherence and consistency in generated texts. To address these challenges, we propose a novel multi-stage generative framework based on diffusion models. A cross-attention mechanism is incorporated that simultaneously attends to both textual and visual features, significantly improving the model’s ability to align image content with accurate textual descriptions. Additionally, we optimize multimodal information fusion by integrating skip connections, Long Short-Term Memory (LSTM) networks, and MIX-MLP networks, enhancing the flow of information between different modalities. By integrating advanced multimodal fusion mechanisms, our approach enhances the accuracy and coherence of automatic report generation. The model was evaluated on IU-XRAY and MIMIC-CXR datasets, achieving state-of-the-art performance across multiple metrics, including BLEU, METEOR, and ROUGE, significantly surpassing prior methods. These results validate the framework’s effectiveness in generating professional and coherent medical reports, offering a reliable solution to alleviate the burden of manual reporting. The source code is available at https://github.com/watersunhznu/DifMIRG.
Loading