VDGen: Visual and Diagnostic Representation Calibration for Radiology Report Generation

ACL ARR 2025 May Submission5309 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Automated radiology report generation (RRG) offers the potential to reduce clinical workload and enhance diagnostic consistency. However, existing models struggle with degraded visual representations caused by long-tailed lesion distributions, and suffer from limited alignment between image features and diagnostic semantics. We propose \textbf{VDGen}, a unified framework for calibrating visual and diagnostic representations to improve disease-aware report generation. VDGen integrates two complementary modules: \textbf{Vision Self-Equilibration (VSE)}, a self-supervised contrastive module that mitigates visual feature degradation by promoting structured representation learning; and \textbf{Disease Information Distillation (DID)}, a cross-modal distillation mechanism that uses diagnostic reports as teacher signals to guide the extraction of disease-sensitive semantics from visual features. Our end-to-end architecture incorporates a LoRA-adapted large language model (LLM) decoder to generate clinically accurate reports. Experiments on the IU-Xray and MIMIC-CXR datasets show that VDGen achieves state-of-the-art performance on MIMIC-CXR and maintains competitive results on IU-Xray. Code and models will be released upon acceptance.
Paper Type: Long
Research Area: Generation
Research Area Keywords: clinical NLP, cross-modal content generation, healthcare applications, multimodal applications
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 5309
Loading