C2RG: Parameter-efficient Adaptation of 3D Vision and Language Foundation Model for Coronary CTA Report Generation

Zhiyu Ye, Yue Sun, Wei Shi, Bang Yang, Shibin Wu, Hancong Wang, Cheng Xu, Hairong Zheng, Yining Wang, Tong Zhang

Published: 2024, Last Modified: 07 Apr 2025BIBM 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Medical report generation (MRG) is a challenging yet highly demanding task in the application of multi-modal artificial intelligence in medicine. Typically, training an MRG model requires tens of thousands of labelled radiology images and reports datasets, which could be impractical for most clinical research groups. In this study, we present C2RG, a novel 3D vision and language foundation model tailored for Coronary Computed Tomography Angiography (CTA) Report Generation. Inspired by BLIP-2’s architecture, our method integrates a self-supervised pre-trained 3D cardiac vision model (ViT-B) and a general-purpose bilingual foundation model (ChatGLM-6B), with a lightweight querying Transformer (Q-Former). We also introduce a parallel high-resolution feature extractor module and a coronary calcification evaluation loss to simultaneously encode fine-grained 3D features and constrain the accuracy of report generation. We compared our model with six state-of-the-art MRG methods on a clinical dataset with 118 subjects, comprising 453 paired 3D CTA images and radiology reports. Experimental results with extensive ablations show the efficacy of our C2RG. Codes will be open-sourced after the conference.