\section{Experiments}
% \input{tables/rl}
\subsection{Models}
We evaluate multiple representative MLLMs. For open-source models, we
evaluate MiniCPM-o 2.6~\cite{yao2024minicpm},NVILA-15B~\cite{liu2024nvila}, llava-1.5-7b~\cite{li2024llavaonevisioneasyvisualtask},
VideoLLaMA3~\cite{damonlpsg2024videollama2}, Qwen2.5-VL-[7B/32B/72B]~\cite{bai2025qwen25vltechnicalreport}, deepseek-vl2~\cite{wu2024deepseekvl2mixtureofexpertsvisionlanguagemodels}, InternVL2.5-78B~\cite{chen2025expandingperformanceboundariesopensource}. For proprietary models, we evaluate Claude-3.5-Sonnet~\cite{anthropic_claude3.5sonnet_2024}, gpt-4o~\cite{openai_gpt4o_2024}, Gemini2.5-[flash/pro]~\cite{google_gemini1.5flash_2024}.
For all these models, we adopt the original model and official instruction formats.
\input{tables/main_1}

\subsection{Baseline}
We recruit two categories of people to complete the first three tasks. The first category consists of five laypersons recruited via a crowdsourcing platform, and the second category comprises three experts with extensive origami experience. Specific details of the human evaluation are provided in Appendix \ref{app:human_3}. For the CP code generation task, we adopt the following settings:
% We recruit two categories of participants to complete the first three tasks. The first category consists of five laypersons recruited via a crowdsourcing platform, and the second category comprises three experts with extensive origami experience. Specific details of the human evaluation are provided in Appendix \ref{app:human_3}. For the CP code generation task, we adopt the following settings:

\textbf{In-context learning}
In this setting, we provide the model with detailed task instructions and a set of CP code examples. The instructions will introduce the meaning represented by each part of the CP code and all the constraints that must be followed. MLLMs need to generate the complete CP code in one go based on these instructions and examples. 
% The specific prompt is shown in Appendix \ref{prompt}.

\textbf{Environmental learning}
In this setting, MLLMs no longer attempt to generate the complete CP code in one go, but instead engage in iterative interaction with the compiler. Specifically, the MLLM will first perform planning, then generate CP code. The compiler will return its compilation result, and the model then performs inference based on the returned compilation result, subsequently choosing to add or delete creases, iterating in this manner. We set the upper limit of interaction rounds to 10.


\textbf{Reinforcement learning}
Through a constructed compilation environment, we explore a reinforcement learning approach. We utilize the 471 sets of data mentioned in Section \ref{data_collection} for training, sampling data in the same process as in environmental learning. The reward mechanism is set as follows: (1) Intermediate reward: After modifying the code, if compilation is successful, a reward is given based on the quality progress of the current partial CP code ($S_{partial} - S_{partial\_prev}$, where $S_{partial}$ is a quickly evaluated partial quality score), plus a small basic compilation success reward. If compilation fails, a fixed negative penalty is given. (2) Step penalty: A small negative reward is received for each action taken to encourage efficiency. (3) Final reward: After the interaction ends, the result of the evaluation function defined in Section \ref{code_eval} serves as the main reward. We adopt TRICO~\cite{VAGEN} for training on qwen2.5-vl-32B, which is a PPO-based~\cite{schulman2017proximal}, more efficient MLLMs multi-turn reinforcement learning algorithm. Specific training settings and parameters can be found in Appendix \ref{train}.


\input{tables/main_2}

\subsection{Main Results}
Tasks 1 to 3 primarily focus on spatial analysis and prediction. The results shown in Table \ref{main_1} are the average of three runs for different MLLMs, from which we observe that:
1) For MLLMs, \dataset is a challenging task; the performance of poor-performing models is close to random guessing (25\%), and even for the best-performing models, there is a significant gap compared to human performance, especially in multi-step spatial reasoning.
2) Despite the different task types, the relative performance ranking of various models largely remains consistent, with Gemini 2.5-pro and GPT-4o demonstrating the best spatial reasoning ability.
3) Human experts perform well on all tasks, demonstrating the task's upper bound.
4) MLLMs perform worst on the Spatial Relationship Prediction task, especially the sub-tasks involving Geometric Change, indicating significant difficulty for models in understanding fine-grained, internal spatial structures.

Table \ref{main_2} presents the results of different methods and models on Task 4. We observe the following:
1) Impact of learning settings: The results clearly indicate the significant impact of learning settings on performance. In-context learning shows relatively limited performance. Environmental learning brings significant performance improvements, demonstrating that through iterative interaction with the compiler, planning, and trial-and-error based on feedback, models can overcome the limitations of one-shot generation. Reinforcement learning shows potential, as the trained Qwen2.5-VL-32B surpassed the performance of a 72B model.
2) There are significant performance differences among different models, with top-tier closed-source models exhibiting the best spatial reasoning capabilities.

\begin{figure}[htbp]
    \centering
    \includegraphics[width=1\linewidth]{figs/a.pdf}
    \caption{The impact of interaction rounds on the compilation pass rate and total score of different models.}
    \label{fig:d}
\end{figure}

\subsection{Impact of Mathematical Constraints}
Mathematical constraints present a primary challenge in generating valid CP codes for the \dataset task. Table \ref{main_2} indicates that failing to satisfy constraints is the main bottleneck for compilation failures; even when provided with detailed instructions, models struggle to strictly adhere to these complex rules, leading to persistently high compilation failure rates. Interactive processes with the environment enhance models' ability to follow constraints, demonstrating that models can learn and internalize rules from feedback. Compared to environmental learning, reinforcement learning also shows improvement in constraint satisfaction, proving the effectiveness of specific reward mechanisms. However, even with interactive learning, precisely satisfying all mathematical constraints remains a significant challenge for top-tier models (such as GPT-4o and Gemini 2.5-pro, whose \textit{constraint satisfaction} score is only 56.99\% under environmental learning settings). This reveals MLLMs' deficiencies in deep multi-step geometric and layering reasoning and highlights the value of the fine-grained feedback and constraint satisfaction evaluation introduced in this study.

\subsection{Impact of Interaction Rounds in Environmental Learning}
Figure \ref{fig:d} illustrates the impact of interaction rounds on model performance across different dimensions under the environmental learning setting. We observe that as the number of interaction rounds increases, model performance improves in various aspects, particularly the compilation pass rate. However, performance tends to saturate after 8-10 rounds, indicating that interaction primarily helps overcome initial learning obstacles but struggles to break through the model's inherent bottlenecks. Weaker models, limited by their understanding capabilities, reach their upper limit in fewer rounds. The reinforcement learning-trained Qwen2.5-VL-32B also follows a similar trend, but due to policy optimization, it may reach its performance ceiling in fewer rounds.

