Spatial Reasoning in Foundation Models: Benchmarking Object-Centric Spatial Understanding

Published: 23 Sept 2025, Last Modified: 19 Nov 2025SpaVLE PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Spatial reasoning, Object localization, Vision-language models (VLMs), Open-vocabulary detection (OVD)
Abstract: Spatial understanding is a critical capability for vision foundation models. While recent advances in large vision models or vision–language models (VLMs) have expanded recognition capabilities, most benchmarks emphasize localization accuracy rather than whether models capture how objects are arranged and related within a scene. This gap is consequential: effective scene understanding requires not only identifying objects, but reasoning about their relative positions, groupings, and depth. In this paper, we present a systematic benchmark for object-centric spatial reasoning in foundation models. Using a controlled synthetic dataset, we evaluate state-of-the-art vision models (e.g., GroundingDINO, Florence-2, OWLv2) and large VLMs (e.g., InternVL, LLaVA, GPT-4o) across three tasks: spatial localization, spatial reasoning, and downstream retrieval tasks. We find a stable trade-off: detectors such as GroundingDINO and OWLv2 deliver precise boxes with limited relational reasoning, while VLMs like SmolVLM and GPT-4o provide coarse layout cues and fluent captions but struggle with fine-grained spatial context. Our study highlights the gap between localization and true spatial understanding, and pointing toward the need for spatially-aware foundation models in the community.
Submission Type: Short Research Paper (< 4 Pages)
Submission Number: 36
Loading