Multimodal Dual-Path Large-Model Decoding for Radiology Report Generation

Multimodal Dual-Path Large-Model Decoding for Radiology Report Generation

ACL ARR 2026 January Submission3110 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Radiology report generation, Multimodal inference, Dual-path decoding

Abstract: Radiology report generation requires precise alignment between medical imaging findings and clinically coherent textual descriptions. While current methods predominantly rely on either large vision-language models (LVLMs) for visual grounding or large language models (LLMs) for medical narrative generation, they often fail to effectively integrate multimodal clinical evidence with domain-specific knowledge. This paper proposes a novel multimodal dual-path framework that synergistically combines LVLMs and LLMs to address these limitations. Our approach establishes a dynamic fusion between LVLMs' visual-semantic grounding capabilities and LLMs' clinical knowledge reasoning. Specifically, we employ a structured prompting strategy that models the report generation task into three clinically meaningful sections and introduces fine-grained multi-label classification prompts to guide the models, enabling more accurate and comprehensive clinical report generation. Experiments on the public MIMIC-CXR and IU-Xray benchmarks demonstrate our framework's superiority over state-of-the-art methods.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: NLP Applications, Multimodality and Language Grounding to Vision, Robotics and Beyond

Languages Studied: English

Submission Number: 3110

Loading