\section{Related Work}

\subsection{Spatial Reasoning Benchmarks}
Evaluating the spatial reasoning abilities of MLLMs is crucial for advancing their application in real-world scenarios, but existing benchmarks have certain limitations~\cite{li2025benchmark, zhou2022vlue}. CLEVR~\cite{johnson2016clevrdiagnosticdatasetcompositional} and Visual Genome~\cite{krishna2016visualgenomeconnectinglanguage} focus on static scene understanding or single-step reasoning and often operate in synthetic environments, making it challenging to reflect the complexities of the real world. NLVR2~\cite{suhr2019nlvr2visualbiasanalysis} concentrates on comparative reasoning through image pairs but struggles to measure a model's ability to understand and execute tasks involving multiple spatial state transitions. StepGame~\cite{shi2022stepgamenewbenchmarkrobust} and LEGO-Puzzles~\cite{tang2025lego} explore multi-step processes, but they are either limited to pure text models or do not sufficiently emphasize precise geometric and physical constraints. Furthermore, interaction with the environment and understanding physical manipulation are also weak points in current evaluation methods. Many benchmarks primarily rely on static inputs and less frequently involve tasks that require models to predict or guide a sequence of physical actions.
To address these challenges, we propose \dataset. By introducing origami, a structured and complex multi-step physical task, \dataset directly targets the shortcomings of existing benchmarks. It leverages origami's inherent precise geometric constraints and sequence of operations, aiming to comprehensively and deeply evaluate the capabilities of MLLMs in complex, dynamic spatial reasoning.

\subsection{Computational Origami}

Computational origami is an emerging field within computer science that focuses on studying algorithms for solving origami-related problems~\cite{demaine2002recent, lang1996computational}. This field covers two main aspects: origami design~\cite{lang1996computational} and origami foldability~\cite{tachi2017self, li2019architected}. Origami design involves the development of algorithms to generate origami crease patterns with specific shapes or functionalities~\cite{silverberg2014using}. Origami foldability, on the other hand, investigates how to determine whether a given crease pattern can be folded into a particular shape, especially flat-foldability~\cite{tachi2009generalization}.
% Our work does not focus on designing new origami models or developing origami simulators. Instead, we utilize origami as a tool to evaluate the spatial reasoning abilities of general-purpose MLLMs. We draw upon knowledge from computational origami regarding crease patterns, folding processes, and mathematical principles. We construct an interaction interface between MLLMs and a compilation system and optimize functions for evaluating crease patterns, building a benchmark for testing MLLMs in multi-step spatial manipulation and constraint satisfaction.
Our work does not focus on designing new origami models; instead, we leverage the characteristics of origami to evaluate the spatial reasoning abilities of MLLMs. Drawing upon knowledge from computational origami regarding crease patterns, folding processes, and mathematical principles, we have optimized an existing origami compilation system and evaluation functions for crease patterns, thus establishing a benchmark designed to test the capabilities of MLLMs in multi-step spatial manipulation and constraint satisfaction.