Multimodal Dual-Path Decoding for Medical Report Generation

Multimodal Dual-Path Decoding for Medical Report Generation

ACL ARR 2025 May Submission2591 Authors

19 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Radiology report generation requires precise alignment between medical imaging findings and clinically coherent textual descriptions. While current methods predominantly rely on either large vision-language models (LVLMs) for visual grounding or large language models (LLMs) for medical narrative generation, they often fail to effectively integrate multimodal clinical evidence with domain-specific knowledge. This paper proposes a novel multimodal dual-path framework that synergistically combines LVLMs and LLMs to address these limitations. Our approach establishes a dynamic fusion between LVLMs' visual-semantic grounding capabilities and LLMs' clinical knowledge reasoning. Specifically, we employ a structured prompting strategy that models the report generation task into three clinically meaningful sections and introduces fine-grained multi-label classification prompts to guide the models, enabling more accurate and comprehensive clinical report generation. Experiments on the public MIMIC-CXR benchmark demonstrate our framework's superiority over state-of-the-art methods.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: Radiology report generation, Multimodal inference, Dual-path decoding

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 2591

Loading