Keywords: Radiology report generation, Multimodal inference, Dual-path decoding
Abstract: Radiology report generation requires precise alignment between medical imaging findings and clinically coherent textual descriptions. While current methods predominantly rely on either large vision-language models (LVLMs) for visual grounding or large language models (LLMs) for medical narrative generation, they often fail to effectively integrate multimodal clinical evidence with domain-specific knowledge. This paper proposes a novel multimodal dual-path framework that synergistically combines LVLMs and LLMs to address these limitations. Our approach establishes a dynamic fusion between LVLMs' visual-semantic grounding capabilities and LLMs' clinical knowledge reasoning. Specifically, we employ a structured prompting strategy that models the report generation task into three clinically meaningful sections and introduces fine-grained multi-label classification prompts to guide the models, enabling more accurate and comprehensive clinical report generation. Experiments on the public MIMIC-CXR and IU-Xray benchmarks demonstrate our framework's superiority over state-of-the-art methods.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: NLP Applications, Multimodality and Language Grounding to Vision, Robotics and Beyond
Languages Studied: English
Submission Number: 3110
Loading