Abstract: Automated radiology report generation (RRG) offers the potential to reduce clinical workload and enhance diagnostic consistency. However, existing models struggle with degraded visual representations caused by long-tailed lesion distributions, and suffer from limited alignment between image features and diagnostic semantics.
We propose \textbf{VDGen}, a unified framework for calibrating visual and diagnostic representations to improve disease-aware report generation. VDGen integrates two complementary modules: \textbf{Vision Self-Equilibration (VSE)}, a self-supervised contrastive module that mitigates visual feature degradation by promoting structured representation learning; and \textbf{Disease Information Distillation (DID)}, a cross-modal distillation mechanism that uses diagnostic reports as teacher signals to guide the extraction of disease-sensitive semantics from visual features.
Our end-to-end architecture incorporates a LoRA-adapted large language model (LLM) decoder to generate clinically accurate reports. Experiments on the IU-Xray and MIMIC-CXR datasets show that VDGen achieves state-of-the-art performance on MIMIC-CXR and maintains competitive results on IU-Xray. Code and models will be released upon acceptance.
Paper Type: Long
Research Area: Generation
Research Area Keywords: clinical NLP, cross-modal content generation, healthcare applications, multimodal applications
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 5309
Loading