Abstract: While 3D visual self-supervised learning (vSSL) has demonstrated strong performance in capturing visual features, it overlooks the integration of clinical knowledge from radiology reports. Furthermore, 3D medical vision-language pre-training (MedVLP) has been hindered by the lack of large-scale, publicly available 3D medical image-report datasets. To address this gap, we introduce **CT-3DVLP**, the first and largest **public** 3D volume-report dataset, providing a new benchmark for 3D MedVLP research. Additionally, we propose the **T3D** framework, which surpasses naive CLIP-style alignment by incorporating **Text-informed Multi-view Alignment (TMA)**, a novel method that clusters volumetric data while ensuring consistency across different views of the same volume-report pair. This method integrates textual features into fine-grained visual representations, enhancing contextual coherence. We evaluate T3D across various tasks, including zero-shot and fine-tuned classification, cross-modal retrieval, report generation, and semantic segmentation, and show that it outperforms existing vSSL and multimodal methods, setting a new standard for 3D medical image understanding.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodal learning in healthcare
Contribution Types: NLP engineering experiment
Languages Studied: english
Submission Number: 4222
Loading