Chest X-Ray Report Generation Using Abnormality Guided Vision Language Model

Published: 01 Jan 2025, Last Modified: 25 Sept 2025IEEE Access 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Automated radiology report generation is essential for assisting radiologists in managing the growing volume of chest radiographs. Unlike conventional image captioning models, radiology report generation must prioritize abnormality-specific feature extraction over generic image features. However, existing models lack an inherent mechanism to extract clinically significant abnormality features directly from radiographs without relying on external knowledge bases or auxiliary models. Additionally, most existing approaches use a single vision encoder, limiting their ability to capture complementary visual cues essential for accurate abnormality identification. To address these limitations, we introduce META-CXR (Multimodal Expert Tokens-based VLM for Abnormality-Guided Chest X-ray Reporting), a vision-language model (VLM) designed to enhance both abnormality classification and radiology report generation. META-CXR employs a multi-encoder visual backbone that combines CNNs, ViTs, and Swin Transformers to extract diverse and complementary visual features. These multi-encoder features serve as a shared representation and are utilized in two key components of the architecture: the META-Former module, a modified Q-Former designed to fuse heterogeneous encoder outputs into a unified token space for effective report generation, and the Multi-Head Cross Attention Classification (MHCAC) module, which performs multi-class, multi-label abnormality classification, including an explicit “uncertain” category. The identified abnormalities are then integrated into the language modeling process as soft prompts to ensure clinically coherent and diagnosis-aware report generation using a large language model (LLM). META-CXR achieves state-of-the-art performance, demonstrating an F1-score of 0.699 for cross-domain multi-class, multi-label abnormality classification on the CheXpert dataset. Additionally, it achieves strong natural language generation results, with a BERTScore of 0.426 and a METEOR score of 0.173 on the MIMIC-CXR test set. Ablation studies validate the contributions of the multi-encoder backbone, hierarchical classification via MHCAC, the META-Former fusion strategy, and classification-aware report generation. Furthermore, attention map visualizations improve interpretability by highlighting clinically relevant image regions, offering radiologists transparent insights into model decisions. By addressing key challenges in abnormality-specific feature extraction, uncertainty-aware classification, and diagnosis-driven reporting, META-CXR establishes a new benchmark for vision-language models in radiology AI.
Loading