LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?

LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?

ICLR 2026 Conference Submission1178 Authors

03 Sept 2025 (modified: 26 Jan 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Large Language Models, Spatial Reasoning, Sequential Reasoning, Visual Question Answering

Abstract: Real-world application of spatial intelligence, such as robotic control, autonomous driving, and automated assembly, often require spatial reasoning across multiple sequential steps, yet the extent to which current Multimodal Large Language Models (MLLMs) possess this capability remains largely unexplored. Based on LEGO construction, a recreational activity that critically relies on multi-step spatial reasoning, we introduce $\textbf{LEGO-Puzzles}$, a benchmark designed to systematically evaluate the spatial reasoning capabilities of MLLMs from basic spatial understanding to complex multi-step planning. LEGO-Puzzles contains two task sets. The $\textbf{Elementary}$ set covers $11$ visual question-answering (VQA) tasks with $1,100$ carefully curated samples to test elementary spatial reasoning skills that are cruical for LEGO assembly. The $\textbf{Planning}$ set directly requires the model to generate a step-by-step plan for assembling a target LEGO structure, where the number of intermediate steps required to complete the task varies from $1$ to $8$. Our evaluation of 23 state-of-the-art MLLMs shows that even the strongest models struggle with elementary reasoning tasks, falling at least 20\% behind human performance. The planning accuracy also quickly drops to $0\%$ as the number of steps increases, while our human participants solve all the tasks perfectly. Furthermore, changing the output format of LEGO-Puzzles tasks from multiple choice to image generation significantly reduces performance to near zero. Only GPT-4o and Gemini-2.0-Flash exhibit a limited ability to follow the image generation instructions, while other MLLMs either replicate the input image or generate completely irrelevant outputs. Overall, LEGO-Puzzles reveals critical limitations in current MLLMs’ spatial reasoning capabilities and highlights the need for substantial advances.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 1178

Loading