Abstract: Vision-Language Models (VLMs) have achieved remarkable success but exhibit a fundamental deficiency in spatial intelligence, a critical capability for progress in embodied AI, autonomous driving, and spatially coherent generation. In response, the research community has produced an explosion of work dedicated to enhancing these models, but this rapid progress has resulted in a fragmented and disorganized landscape lacking a unified framework. This paper presents the first comprehensive survey to address this gap, uniquely providing a systematic review that spans the foundations of spatial intelligence in VLMs, root causes of spatial limitations, enhancement methodologies, evaluation protocols, and real-world applications. Specifically, we introduce a novel, intervention-based taxonomy that categorizes enhancement methodologies according to where spatial information is incorporated: (1) training-free prompting, (2) model-centric enhancements (training strategies, architectural modules, encoder improvements), (3) explicit 2D information injection, (4) 3D spatial enrichment, and (5) data-centric approaches. To further assess the true capabilities of current models, we conduct a rigorous empirical study evaluating 37 models across 9 representative benchmarks. Our results and analysis reveal the state-of-the-art, identify the strengths and weaknesses of different methods, and uncover critical limitations in existing evaluation protocols. By structuring this rapidly evolving field and establishing a clear research agenda, this survey serves as an indispensable resource for advancing the next generation of spatially intelligent AI systems.
External IDs:doi:10.36227/techrxiv.176231405.57942913/v2
Loading