\section{Introduction}
Spatial reasoning is a core component of artificial intelligence~\cite{chen2024spatialvlm, stogiannidis2025mind}, with wide applications in robotics~\cite{song2024robospatial}, autonomous driving~\cite{yang2025lidar}, and geographic information systems~\cite{aliman2024developing}. Although multimodal large language models (MLLMs) demonstrate outstanding performance in various vision and language tasks~\cite{zhang2024mm, caffagni2024revolution}, they face challenges in imagining spatial transformations and grasping spatial relationships in image and text spaces. Evaluating their spatial reasoning ability has become an important task.

Multi-step reasoning and constraints are critical yet underexplored areas in spatial intelligence. Current spatial reasoning benchmarks typically focus on understanding static images or simple scenes~\cite{li2025benchmark}. Some studies are dedicated to comparing and reasoning about spatial relationships between image pairs, but lack attention to continuous spatial transformations~\cite{johnson2016clevrdiagnosticdatasetcompositional, li2023superclevrvirtualbenchmarkdiagnose}. Some studies propose multi-step spatial reasoning but do not involve interaction with the environment and lack constraints found in real-world tasks~\cite{tang2025lego}. These limitations indicate a current need for a new benchmark to more comprehensively evaluate the capabilities of MLLMs in complex spatial reasoning scenarios.

Origami art offers an ideal platform for evaluating complex spatial reasoning abilities~\cite{misseroni2024origami}. 
Origami involves a sequence of ordered folding operations, where each step depends on the result of the previous one, embodying the essence of multi-step reasoning. 
Furthermore, the origami process is governed by explicit geometric constraints, such as folds must occur along straight lines, and the paper cannot be torn or separated; all origami operations are defined by strict mathematical constraints (\textit{Kawasaki's Theorem}, \textit{Huzita-Hatori axioms}, etc.)~\cite{carberry2004kawasaki, kasem2011origami}. 
The transformation from a two-dimensional crease pattern (CP diagram) through multiple folding steps to a three-dimensional folded shape image requires strong spatial imagination and reasoning abilities.

\begin{figure}[htbp]
    \centering
    \includegraphics[width=0.8\linewidth]{figs/front.pdf}
    \caption{An example data instance from \dataset includes: CP Diagram, Compiled Flat Pattern, Folded Shape Image, and Folding Process, where the CP Diagram can be represented in the form of CP Code.}
    \label{fig:case}
\end{figure}


To bridge the gap of existing benchmarks, this paper introduces the \dataset dataset and benchmark. This dataset contains 350 meticulously collected origami data instances, including a CP diagram, its corresponding compiled flattened pattern, illustrations of the complete folding process, and the final folded shape. The diversity and complexity of the data cover various origami types.
We improve the existing origami compiler, enabling it to output detailed flattened diagrams that include crease locations and stacking relationships, support interactive simulation with MLLMs, and provide more comprehensive error feedback.
Based on this dataset, we design four challenging evaluation tasks: pattern prediction, spatial relationship prediction, multi-step spatial reasoning, and end-to-end CP code generation, which comprise  1,500 multiple-choice questions and 120 code generation questions.
For the code generation task, we meticulously design a comprehensive evaluation strategy to measure the quality of the generated CP code across multiple dimensions.

The core advantages of \dataset lie in its authenticity (derived from real origami designs), multi-step reasoning characteristics (reflecting the inherent process of origami), and rigorous mathematical constraints (precisely verifiable through origami theorems). We evaluate the performance of various MLLMs on \dataset, and introduce environmental learning and reinforcement learning methods for the code generation task, which opens up new perspectives and effective avenues for assessing and enhancing the spatial reasoning abilities of MLLMs.

The main contributions of this paper include:
\begin{itemize}[topsep= 1ex,leftmargin=2.5\labelsep]
\item We introduce \dataset, a dataset containing 350 high-quality origami data instances, and optimize the existing origami compiler, enabling it to provide more comprehensive feedback.
\item We design four challenging tasks centered around spatial reasoning, including 1,500 multiple-choice questions and 120 CP code generation questions, which is the first benchmark to evaluate the multi-step spatial reasoning ability of MLLMs under mathematical constraints.
\item We conduct a comprehensive evaluation of existing MLLMs and develop a complete interactive environment for the end-to-end CP code generation task, and explore environmental learning and reinforcement learning methods through this environment.
\end{itemize}