\section{Limitation}
\label{lim}
While the \dataset benchmark and dataset offer a novel approach to evaluating multi-step spatial reasoning in MLLMs, we acknowledge certain limitations that provide avenues for future work.
Firstly, although our dataset comprises 350 meticulously collected origami instances, the overall scale is relatively modest compared to some large-scale benchmarks in other vision and language domains. Future efforts could focus on expanding the dataset size and further diversifying the range of origami types and complexities included, potentially through semi-automated generation techniques, to ensure even broader coverage and statistical power.
Secondly, while origami provides an excellent structured environment with clear mathematical constraints, the direct transferability of MLLM performance and the specific reasoning mechanisms learned on \dataset to other, less constrained or visually distinct spatial reasoning tasks (e.g., understanding dynamic real-world scenes or interpreting abstract diagrams from different fields) warrants further investigation. Exploring this generalization gap could be a valuable direction for future research.
Finally, our current set of evaluation tasks, though designed to be challenging, focuses on specific facets of spatial reasoning highlighted by origami. There may be other subtle aspects of spatial intelligence or different interaction modalities with the origami compilation process that could be explored in future iterations to provide an even more holistic assessment of MLLM capabilities.