Keywords: Spatial Intelligence, Vision-Language Models, Multimodal Reasoning, Benchmark
TL;DR: This work introduces SIRI-Bench, a benchmark with 9000 samples to evaluate VLMs’ structural spatial intelligence through spatial-grounded reasoning tasks
Abstract: Large Language Models (LLMs) have undergone rapid progress, largely attributed to reinforcement learning on complex reasoning tasks.
In contrast, while spatial intelligence is fundamental for Vision-Language Models (VLMs) in real-world interaction, the systematic study of their complex spatial reasoning remains underexplored.
To bridge this gap, we introduce SIRI-Bench, a benchmark designed to evaluate VLMs’ structural spatial intelligence through spatial-grounded reasoning tasks.
SIRI-Bench comprises 9,000 video-question-answer triplets, where each problem is embedded in a realistic 3D scene.
The benchmark is carefully designed so that solving each problem requires both spatial comprehension and structural reasoning.
To facilitate large-scale data synthesis, we develop an Automatic Scene Creation Engine that employs collaborative LLM agents to translate abstract mathematical problems into faithful 3D scenes.
Experimental results reveal that state-of-the-art VLMs struggle significantly on SIRI-Bench, underscoring the challenge of structural spatial reasoning.
We hope that our study will bring researchers’ attention to spatially grounded reasoning and advance VLMs in visual problem-solving.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 1406
Loading