Challenging VLMs' Structural Spatial Intelligence through Complex Reasoning Tasks

Zijian Song; Xiaoxin Lin; Qiuming Huang; Guangrun Wang; Liang Lin

Challenging VLMs' Structural Spatial Intelligence through Complex Reasoning Tasks

Zijian Song, Xiaoxin Lin, Qiuming Huang, Guangrun Wang, Liang Lin

03 Sept 2025 (modified: 29 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Spatial Intelligence, Vision-Language Models, Multimodal Reasoning, Benchmark

TL;DR: This work introduces SIRI-Bench, a benchmark with 9000 samples to evaluate VLMs’ structural spatial intelligence through spatial-grounded reasoning tasks

Abstract: Large Language Models (LLMs) have undergone rapid progress, largely attributed to reinforcement learning on complex reasoning tasks. In contrast, while spatial intelligence is fundamental for Vision-Language Models (VLMs) in real-world interaction, the systematic study of their complex spatial reasoning remains underexplored. To bridge this gap, we introduce SIRI-Bench, a benchmark designed to evaluate VLMs’ structural spatial intelligence through spatial-grounded reasoning tasks. SIRI-Bench comprises 9,000 video-question-answer triplets, where each problem is embedded in a realistic 3D scene. The benchmark is carefully designed so that solving each problem requires both spatial comprehension and structural reasoning. To facilitate large-scale data synthesis, we develop an Automatic Scene Creation Engine that employs collaborative LLM agents to translate abstract mathematical problems into faithful 3D scenes. Experimental results reveal that state-of-the-art VLMs struggle significantly on SIRI-Bench, underscoring the challenge of structural spatial reasoning. We hope that our study will bring researchers’ attention to spatially grounded reasoning and advance VLMs in visual problem-solving.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 1406

Loading