The Spatial Blindspot of Vision-Language Models

Nahid Alam; Leema Krishna Murali; Siddhant Bharadwaj; Patrick Liu; Timothy Chung; Drishti Sharma; Akshata; Kranthi Kiran GV; Wesley Tam; Bala Krishna S Vegesna

The Spatial Blindspot of Vision-Language Models

Nahid Alam, Leema Krishna Murali, Siddhant Bharadwaj, Patrick Liu, Timothy Chung, Drishti Sharma, Akshata, Kranthi Kiran GV, Wesley Tam, Bala Krishna S Vegesna

17 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: vlms, spatial, image encoders, robotics

TL;DR: Spatial understanding in VLMs

Abstract: Vision-language models (VLMs) have advanced rapidly, but their ability to capture spatial relationships remains a critical blindspot. Current VLMs are typically built with contrastive language-image pretraining (CLIP) style image encoders. The training recipe often flattens images into 1D patch sequences, fundamentally discarding the 2D structure necessary for spatial reasoning. We argue that this lack of spatial awareness is a missing dimension in VLM design and a key bottleneck for applications requiring strong multimodal grounding, such as robotics and embodied AI. To address this, we investigate two overlooked components: (i) image encoders trained with alternative objectives and (ii) 2D positional encodings. Our work shows that these architectural choices lead to models with superior spatial reasoning, highlighting a key but underexplored design space for grounded AI. Code for this work will be released soon.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 9665

Loading