Optimizing Efficiency and Visual-Textual Alignment for LLM-Based Radiology Report Generation

Published: 2025, Last Modified: 19 Jan 2026ICME 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: LLM-based radiology report generation (R2Gen) systems have demonstrated promising performance but face significant challenges in bridging the gap between the visual encoder and the LLM. Specifically, two issues hinder progress: (1) parameter-heavy visual projector that increases complexity and degrades performance, and (2) insufficient alignment between visual and textual modalities, limiting system efficacy. To address these, we propose R2Gen-EVA, a novel framework emphasizing Efficiency and Visual-Textual Alignment (VTA), which introduces two key innovations: (1) a parameter-free visual projector that enhances model efficiency while improving performance, and (2) an LLM-adapted VTA module that enhances the alignment of visual features with LLM’s textual embeddings. Our design significantly improves model efficacy without adding extra parameters, achieving both streamlined complexity and higher computational efficiency during inference. Extensive experiments demonstrate that R2Gen-EVA enhances the fluency and clinical accuracy of generated reports, establishing it as a more effective and efficient solution for LLM-based R2Gen. The code is available at https://github.com/zailongchen/R2Gen-EVA.
Loading