SUBench: Benchmarking Spatial Understanding in Vision-Language Models

09 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-Language Models, Spatial Understanding
Abstract: Recent advancements in vision-language models (VLMs) have shown remarkable success in general text-image retrieval. However, their ability to understand spatial relationships within images remains undertested. To mitigate this gap, in this paper, we introduce **SUBench**, a large-scale dataset of more than 50$k$ text-image pairs meticulously designed to evaluate a wide range of spatial relationships from real-world images. To curate the dataset, we designed an LLM-based framework to align human subjective descriptions and objective spatial relationships. In addition, unlike existing benchmarks, SUBench features a principled taxonomy of spatial concepts to ensure clarity and reduce ambiguity, alongside a scalable pipeline that systematically generates challenging hard negatives. Our experiments show that even state-of-the-art CLIP models struggle significantly on SUBench, revealing a critical blind spot in the spatial understanding capabilities of modern VLMs. Furthermore, we used the same approach to curate a set of training data and show that finetuning on this data not only improves performance significantly on SUBench but also enhances results on existing evaluation benchmarks. We will release the benchmark and believe SUBench will serve as a valuable resource to facilitate the development of more spatially-aware VLMs.
Primary Area: datasets and benchmarks
Submission Number: 3331
Loading