Abstract: Radiology report generation requires precise alignment between medical imaging findings and clinically coherent textual descriptions.
While current methods predominantly rely on either large vision-language models (LVLMs) for visual grounding or large language models (LLMs) for medical narrative generation, they often fail to effectively integrate multimodal clinical evidence with domain-specific knowledge.
This paper proposes a novel multimodal dual-path framework that synergistically combines LVLMs and LLMs to address these limitations.
Our approach establishes a dynamic fusion between LVLMs' visual-semantic grounding capabilities and LLMs' clinical knowledge reasoning.
Specifically, we employ a structured prompting strategy that models the report generation task into three clinically meaningful sections and introduces fine-grained multi-label classification prompts to guide the models, enabling more accurate and comprehensive clinical report generation.
Experiments on the public MIMIC-CXR benchmark demonstrate our framework's superiority over state-of-the-art methods.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: Radiology report generation, Multimodal inference, Dual-path decoding
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 2591
Loading