Bridging the Dimensionality Gap: A Scoping Review of 3D Vision-Language Models in Medical Imaging

ACL ARR 2026 January Submission1247 Authors

29 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: 3D Vision-Language Model, Volumetric Medical Imaging, Multimodal Fusion, Medical Foundation Models, Radiology Report Generation, Visual Question Answering, Parameter-Efficient Fine-Tuning, Contrastive Learning, Survey, Scoping Review
Abstract: While large multimodal models have achieved remarkable success in general domains, adapting them to medical imaging faces a fundamental dimensionality gap: transition from 2D snapshots to 3D volumetric data (e.g., CT, MRI). This review systematically examines 37 studies on 3D Vision-Language Models in healthcare, capturing the rapid research surge of 2024-2025 in this emerging field. We further propose a categorical framework that classifies these studies by multimodal fusion mechanisms (projection, attention, and adapters), volumetric encoding strategies (slice-based vs. native 3D), and language processing (encoder vs. foundation model). Our analysis highlights the growing adoption of parameter-efficient fine-tuning due to computational constraints, but significant challenges remain, including hallucinations, a lack of spatial grounding, and misalignment between evaluation metrics and clinical utility. This survey aims to clarify current methodologies and identify future directions in volumetric medical AI.
Paper Type: Long
Research Area: Clinical and Biomedical Applications
Research Area Keywords: Multimodality and Language Grounding to Vision, Robotics and Beyond, NLP Applications
Contribution Types: Surveys
Languages Studied: English
Submission Number: 1247
Loading