Medical Report Generation via Multimodal Spatio-Temporal Fusion

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Medical report generation aims at automating the synthesis of accurate and comprehensive diagnostic reports from radiological images. The task can significantly enhance clinical decision-making and alleviate the workload on radiologists. Existing works normally generate reports from single chest radiographs, although historical examination data also serve as crucial references for radiologists in real-world clinical settings. To address this constraint, we introduce a novel framework that mimics the workflow of radiologists. This framework compares past and present patient images to monitor disease progression and incorporates prior diagnostic reports as references for generating current personalized reports. We tackle the textual diversity challenge in cross-modal tasks by promoting style-agnostic discrete report representation learning and token generation. Furthermore, we propose a novel spatio-temporal fusion method with multi-granularities to fuse textual and visual features by disentangling the differences between current and historical data. We also tackle token generation biases, which arise from long-tail frequency distributions, proposing a novel feature normalization technique. This technique ensures unbiased generation for tokens, whether they are frequent or infrequent, enabling the robustness of report generation for rare diseases. Experimental results on the two public datasets demonstrate that our proposed model outperforms state-of-the-art baselines.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: Our work marks an advancement in multimedia and multimodal processing by enhancing the synthesis of diagnostic reports from radiological images. It innovatively addresses the fusion of textual and visual data, crucial for accurately reflecting disease progression and patient history in medical reports. This work exemplifies the impactful application of multimodal processing in healthcare, offering a pathway to more personalized, accurate, and efficient medical diagnostics and patient care.
Submission Number: 3647
Loading