GamiBench: Evaluating Spatial Reasoning and 2D-to-3D Planning Capabilities of MLLMs with Origami Folding Tasks

Ryan Spencer; Roey Yaari; Ritvik Vemavarapu; Joyce Yang; Steven Ngo; Utkarsh Sharma

GamiBench: Evaluating Spatial Reasoning and 2D-to-3D Planning Capabilities of MLLMs with Origami Folding Tasks

Ryan Spencer, Roey Yaari, Ritvik Vemavarapu, Joyce Yang, Steven Ngo, Utkarsh Sharma

09 Nov 2025 (modified: 18 Nov 2025)Submitted to SPARTA_AAAI2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal, MLLMs, LLMs, Spatial Reasoning, Origami, Multimodal Reasoning

TL;DR: GamiBench is an origami-inspired benchmark that tests MLLMs’ spatial reasoning and 2D-to-3D planning across views, introducing VC and IFSR metrics that expose key limitations.

Abstract: Multimodal large language models (MLLMs) are proficient in perception and instruction-following, but they still struggle with spatial reasoning: the ability to mentally track and manipulate objects across multiple views and over time. Spatial reasoning is a key component of human intelligence, but most existing benchmarks focus on static images or final outputs, failing to account for the sequential and viewpoint-dependent nature of this skill. To close this gap, we introduce GamiBench, a benchmark designed to evaluate spatial reasoning and 2D-to-3D planning in MLLMs through origami-inspired folding tasks. GamiBench includes 186 regular and 186 impossible 2D crease patterns paired with their corresponding 3D folded shapes, produced from six distinct viewpoints across three visual question-answering (VQA) tasks: predicting 3D fold configurations, distinguishing valid viewpoints, and detecting impossible patterns. Unlike previous benchmarks that assess only final predictions, GamiBench holistically evaluates the entire reasoning process of the models; measuring cross-view consistency, physical feasibility through impossible-fold detection and interpretation of intermediate folding steps. It further introduces new diagnostic metrics—viewpoint consistency (VC) and impossible fold selection rate (IFSR)—to measure how well models handle folds of varying complexity. By linking geometric evaluation with sequential reasoning, GamiBench enables a comprehensive evaluation of state-of-the-art MLLMs, revealing significant limitations in spatial reasoning capabilities and creating a new pipeline to advance geometric understanding in real-world contexts. We will provide the dataset and code upon acceptance.

Submission Number: 15

Loading