T3D: Advancing 3D Medical Vision-Language Pre-training by Learning Multi-View Visual Consistency

ACL ARR 2025 February Submission4222 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: While 3D visual self-supervised learning (vSSL) has demonstrated strong performance in capturing visual features, it overlooks the integration of clinical knowledge from radiology reports. Furthermore, 3D medical vision-language pre-training (MedVLP) has been hindered by the lack of large-scale, publicly available 3D medical image-report datasets. To address this gap, we introduce **CT-3DVLP**, the first and largest **public** 3D volume-report dataset, providing a new benchmark for 3D MedVLP research. Additionally, we propose the **T3D** framework, which surpasses naive CLIP-style alignment by incorporating **Text-informed Multi-view Alignment (TMA)**, a novel method that clusters volumetric data while ensuring consistency across different views of the same volume-report pair. This method integrates textual features into fine-grained visual representations, enhancing contextual coherence. We evaluate T3D across various tasks, including zero-shot and fine-tuned classification, cross-modal retrieval, report generation, and semantic segmentation, and show that it outperforms existing vSSL and multimodal methods, setting a new standard for 3D medical image understanding.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodal learning in healthcare
Contribution Types: NLP engineering experiment
Languages Studied: english
Submission Number: 4222
Loading