SpatiaLab: Can Vision–Language Models Perform Spatial Reasoning in the Wild?

SpatiaLab: Can Vision–Language Models Perform Spatial Reasoning in the Wild?

ICLR 2026 Conference Submission11019 Authors

Published: 26 Jan 2026, Last Modified: 26 Jan 2026ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Spatial reasoning, Vision–language models, Large languge models, Reasoning Models, LLM Evaluation, Spatial Understanding

Abstract: Spatial reasoning is a fundamental aspect of human cognition, yet it remains a major challenge for contemporary vision–language models (VLMs). Prior work largely relied on synthetic or LLM-generated environments with limited task designs and puzzle-like setups, failing to capture the real-world complexity, visual noise, and diverse spatial relationships that VLMs encounter. To address this, we introduce **_SpatiaLab_**, a comprehensive benchmark for evaluating VLMs’ spatial reasoning in realistic, unconstrained contexts. **_SpatiaLab_** comprises 1,400 visual question–answer pairs across six major categories: *Relative Positioning, Depth & Occlusion, Orientation, Size & Scale, Spatial Navigation,* and *3D Geometry*, each with five subcategories, yielding 30 distinct task types. Each subcategory contains at least 25 questions, and each main category includes at least 200 questions, supporting both multiple-choice and open-ended evaluation. Experiments across diverse state-of-the-art VLMs, including open- and closed-source models, reasoning-focused, and specialized spatial reasoning models, reveal a substantial gap in spatial reasoning capabilities compared with humans. In the multiple-choice setup, InternVL3.5-72B achieves 54.93% accuracy versus 87.57% for humans. In the open-ended setting, all models show a performance drop of around 10–25%, with GPT-5-mini scoring highest at 40.93% versus 64.93% for humans. These results highlight key limitations in handling complex spatial relationships, depth perception, navigation, and 3D geometry. By providing a diverse, real-world evaluation framework, **_SpatiaLab_** exposes critical challenges and opportunities for advancing VLMs’ spatial reasoning, offering a benchmark to guide future research toward robust, human-aligned spatial understanding. We will release **_SpatiaLab_**.

Primary Area: datasets and benchmarks

Submission Number: 11019

Loading