\begin{figure}[htbp]
    \centering
    \includegraphics[width=1\linewidth]{figs/data_6.pdf}
    \caption{Data examples of the four tasks. The first three tasks are in a multiple-choice format, and the fourth task is a code generation task.}
    \label{fig:data}
\end{figure}
\section{Task}
Based on \dataset, we propose four tasks to evaluate the spatial reasoning capabilities of MLLMs comprehensively.

\subsection{Pattern Prediction}
% The input for this task is the CP diagram. MLLMs need to predict the folded shape based on the CP diagram. This task evaluates the model's ability to understand the folding process from the CP diagram and imagine the final 3D shape, as shown in Figure 1. To better quantify the results, we design it as a multiple-choice question. The correct option is the name of the shape itself. For the incorrect options, we invite three origami enthusiasts to design three incorrect options for each diagram, requiring that the incorrect options are easily distinguishable from the correct option, are not variations of the same concept (for example, if the correct option is a cat, the incorrect options are not lions, leopards, etc.), and are close to potential folded states based on the CP diagram (for example, if a few key creases are removed, a boat's CP diagram becomes similar to a hat). We design 350 questions. See Appendix \ref{{app:human_1}} for the specific annotation rules.
This task evaluates the model's ability to understand the folding process from the CP diagram and imagine the final 3D shape. For this task, the input is the CP diagram, and MLLMs are required to predict the resulting folded shape image based on it. 
To enable better quantitative evaluation, we structure this task as a multiple-choice question. 
The correct option is the name of the target shape. 
For the incorrect options, three origami enthusiasts design three options for each diagram, adhering to criteria that require them to be easily distinguishable from the correct option; not be variations of the same concept (e.g., if the correct option is a cat, incorrect options are not lions, leopards, etc.); and be close to potential folded states based on the CP diagram (e.g., removing a few key creases makes a boat's CP diagram similar to a hat). 
We create 350 questions for this task. See Appendix \ref{app:human_1} for the specific annotation rules.

\subsection{Multi-step Spatial Reasoning}
This task evaluates the model's ability to understand the dynamic origami process and the logical relationships between steps. The input for the task is a set of images that collectively show several key steps of a complete origami process. However, the order of these images is randomly shuffled. MLLMs need to infer the correct chronological order in which these steps occur, based on their understanding of the geometric state changes in the images. To better quantify the model's performance, we structure this task as a multiple-choice question. The correct option is the sequence of steps that represents the unique correct folding process (for example, "1-2-3-4"). For the incorrect options, we generate multiple logically incorrect sequences of steps (for example, "1-2-4-3", "4-1-2-3", etc.). These incorrect sequences may contain partially correct local orders but contain errors in the overall flow, in order to test the model's grasp of the complete, coherent process. We design 250 such questions, and the average number of steps per question is 7.5.

\subsection{Spatial Relationship Prediction}
% The input for this task is the CP diagram. The model predicts the spatial relationships between specific parts of the origami model after the final fold is complete. As shown in Figure 1, there are three types of questions, which are: 1) Spatial Pose Localization, i.e., determining the specific position of a specific point on the original paper in the final 3D model, and also considering the model's pose within a specific reference frame (e.g., on a table, facing upwards); 2) Layering Relationship Analysis, i.e., determining the paper stacking after folding, which requires tracking the paper's covering relationships during the folding process and determining how many layers of paper form a specific region (e.g., the thickest region); and 3) Geometric Change Analysis, i.e., predicting the change in specific geometric features (such as angles, distances, areas, etc.) during the folding process. For example, the model determines the relative angle or spatial distance between two line segments on the CP diagram after folding is complete. The correct answers for all three types of questions can be obtained using our optimized compilation program, and incorrect answers are then manually designed. We design a total of 900 multiple-choice questions (300 for each type). See Appendix A for specific annotation rules.
This task evaluates the model's ability to predict spatial relationships and geometric properties after the folding process is complete. For this task, the input is the CP diagram. The model is required to predict specific spatial relationships between parts of the origami model after it is fully folded. The task comprises three types of multiple-choice questions designed to test this ability:
1) \textbf{Spatial Pose Localization}: Determining the specific 3D position of a point from the original paper in the final model, including its pose within a reference frame (e.g., on a table, facing upwards).
2) \textbf{Layering Relationship Analysis}: Determining the paper stacking order after folding, requiring analysis of covering relationships during the folding process and identifying how many paper layers form a specific region (e.g., the thickest region).
3) \textbf{Geometric Change Analysis}: Predicting how specific geometric features (such as angles, distances, areas, etc.) change from the flat CP diagram to the final folded state. For example, predicting the relative angle or spatial distance between two original line segments after folding. The correct answers for all three question types are obtained using our optimized compiler. Incorrect options are then manually designed. We design 900 multiple-choice questions (300 for each type) for this task. See Appendix \ref{app:human_2} for specific annotation rules.


% \subsection{End-to-End CP Code Generation}

% This task requires MLLMs to generate the corresponding CP code based on the compiled flat map and the folded shape image. This CP code should be compilable by our compilation program into a folded pattern identical to the target shape. To comprehensively evaluate the quality of the generated results, we design a multi-dimensional evaluation framework. The specific process is as follows:

% \textbf{Structure Validity Check}
% First, a basic structural validity check is performed on the CP code generated by the model. This step verifies the existence and format compliance of core data structures (such as vertex coordinates, edge definitions, face definitions), ensuring that vertex coordinates, edge definitions (validity of vertex indices), and face definitions (at least 3 vertices and valid indices) all meet the requirements. Simultaneously, it checks whether the crease assignments use predefined valid characters (such as 'B', 'M', 'V', 'F', 'U'). A crucial step is to verify whether it satisfies the Euler characteristic for planar graphs:
% $$ V - E + F = 2 $$
% where $V$, $E$, and $F$ represent the number of vertices, edges, and faces, respectively. Only CP code that passes all structural verifications can proceed to subsequent evaluation. If verification fails, the final score for that generated result is 0, and the reason for failure is recorded.

% \textbf{Compilation and Similarity Evaluation}
% Following structural verification, the system attempts to compile the generated CP code to obtain its corresponding (theoretical) folded state.
% If the generated CP code \textit{compilation fails} because of inner problems or because it does not follow origami geometry rules, it means it has serious flaws. In this case, even though some comparison of the topological structure might still happen, its score for "Constraint Satisfaction" and "Final Folded State" will be very low, greatly reducing the final total score.
% If \textit{compilation is successful}, we will compare the generated CP code and its compiled folded model with a reference CP code (or a reference model taken from the input 3D model). We will compare them in several ways:

% \textbf{1) Topological Structure Similarity}: 
% This dimension evaluates similarity at the graph-theoretic level. It compares their number of vertices (score $ s_v = e^{-0.5 \frac{|V_{gen} - V_{ref}|}{\min(V_{gen}, V_{ref})}} $), edge connectivity (e.g., degree distribution similarity, number of connected components), face relationships (e.g., number of faces, face size distribution), and the distribution similarity of crease types ('M', 'V', 'B', etc.).

% \textbf{2) Geometric Similarity}: 
% This dimension focuses on the spatial characteristics of the compiled model. It evaluates point position similarity (score $ s_p = e^{-k \cdot d_H} $, where $k$ is the sensitivity coefficient, e.g., 5) by calculating the bidirectional Hausdorff distance $d_H$ of normalized 3D point sets. It evaluates angle similarity by comparing the dihedral angle distribution at creases, and evaluates size and proportion similarity by comparing the aspect ratios of the overall bounding box of the model.

% \textbf{3) Constraint Satisfaction}: 
% This dimension evaluates whether the generated CP code complies with the physical and mathematical constraints of origami. This includes comparing the presence and matching degree of key constraint types (Taco-Taco, Taco-Tortilla, Transitivity constraints) and checking for satisfaction of basic theorems of locally flat foldability, such as Maekawa's Theorem (the difference between the number of mountain folds M and valley folds V around a vertex is $ |M-V|=2 $ ) and Kawasaki's Theorem (the sum of crease angles $\alpha_i$ around a vertex is $ \sum \alpha_i = 2\pi $ or 0).


% \textbf{4) Final Folded State}: 
% This dimension directly compares the final compiled 3D model shape. It primarily evaluates overall shape similarity by calculating the Hausdorff distance of point sets and, where possible (if the model provides layering information), compares the layering relationships between facets.

% \textbf{Overall Score:}
% Finally, the system computes the final overall score $S_{total}$. This score is the weighted average of the scores $s_{dim}$ obtained from each evaluation dimension, using preset weights $w_{dim}$. By default, each dimension is weighted at 25\%. The score is calculated using the formula $ S_{total} = \sum_{dim} w_{dim} \cdot s_{dim} $, where $ \sum w_{dim} = 1 $. Ranging from 0 to 1 ($S_{total} \in [0, 1]$), this score reflects the overall quality of the generated CP code in terms of structural validity, topological accuracy, geometric consistency, foldability, and final shape matching. See Appendix A for more details on the evaluation process.

\subsection{End-to-End CP Code Generation}
\label{code_eval}
This task requires the MLLM to generate corresponding CP code based on a compiled flat layout and an image of the folded shape. Ideally, this CP code should compile into a folded pattern identical to the target shape. To comprehensively evaluate the quality of the generated results, we have designed a multidimensional evaluation framework.

\textbf{Compilation Attempt and Evaluation} The CP code generated by the model will first be attempted to be compiled using our origami compiler (see Section \ref{cp} for details). If the \textit{compilation fails}, the model will return one or more error types. If the \textit{compilation succeeds}, meaning the CP code is syntactically valid, geometrically foldable, and free of self-intersections, and produces a definite folded state, the system will compare the compilation result with the reference result across the following four dimensions:

\textbf{1) Topological Structure Similarity (TSS)} This dimension evaluates similarity at the graph theory level by comparing the compiled output. It compares the number of vertices of successfully compiled patterns (score $s_v = e^{-0.5 \frac{|V_{gen} - V_{ref}|}{\min(V_{gen}, V_{ref})}}$), edge connectivity (e.g., similarity of degree distribution, number of connected components), face relationships (e.g., number of faces, distribution of face sizes), and the distribution similarity of crease types ("M", "V", "B", etc.).

\textbf{2) Geometric Similarity (GS)} This dimension focuses on the spatial characteristics of the compiled model. It evaluates point position similarity by calculating the bidirectional Hausdorff distance dH between the normalized 3D point sets of the generated and reference compiled models (score $s_p = e^{-k \cdot d_H}$, where k is a sensitivity coefficient, e.g., 5). It assesses angular similarity by comparing the distribution of dihedral angles at the creases, and evaluates size and proportion similarity by comparing the aspect ratios of the overall bounding boxes of the models.

\textbf{3) Constraint Satisfaction (CS)} This dimension evaluates whether the successfully compiled CP code, beyond the basic foldability ensured by the compiler, further adheres to the physical and mathematical constraints of origami. This includes comparing the presence and matching degree of critical constraint types (Taco-Taco, Taco-Tortilla, transitivity constraints) and checking for satisfaction of fundamental theorems of local flat-foldability, such as Maekawa's theorem (the difference between the number of mountain creases M and valley creases V around a vertex is $|M-V|=2$) and Kawasaki's theorem (the sum of the angles $\alpha_i$ of creases around a vertex is $\sum \alpha_i = 2\pi$ or 0).

\textbf{4) Final Folded State (FFS)} This dimension directly compares the final 3D model shape compiled from the generated CP with the reference compiled 3D model. It primarily evaluates overall shape similarity by calculating the Hausdorff distance of the point sets, and where possible (if the model provides layering information), compares the layering relationships between facets, including paper stacking order information that may be obtained during the compilation process.

\textbf{Total Score:} The final total score $S_{total}$ is a weighted average of the scores $s_{dim}$ from each evaluation dimension: $S_{total} = \sum_{dim} w_{dim} \cdot s_{dim}$. By default, each of the four dimensions accounts for 25\% of the weight ($w_{dim}=0.25$), and $\sum w_{dim} = 1$. This score ranges from 0 to 1 ($S_{total} \in [0,1]$), reflecting the overall quality of the generated CP code. For more details on the evaluation process, please refer to Appendix \ref{app:eval_2}.