EucliFold: Probing 3D Euclidean Prior in VLMs via Cognitively-Stratified Folding Tasks

EucliFold: Probing 3D Euclidean Prior in VLMs via Cognitively-Stratified Folding Tasks

ICLR 2026 Conference Submission25372 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: vision language model, synthetic dataset

Abstract: Humans leverage robust 3D spatial priors to align perception with the physical world, enabling flexible and intelligent behavior. While Vision-Language Models (VLMs) exhibit impressive zero-shot performance, it remains unclear whether they possess genuine spatial reasoning capabilities, as standard evaluations are confounded by dataset bias and spurious correlations. To address this, we introduce **EucliFold**, a synthetic visual question-answering benchmark focused on cube net folding in Euclidean space—a domain that enables precise analysis while requiring genuine spatial understanding. We propose a **cognitively-stratified evaluation framework** that decomposes spatial reasoning into three hierarchical levels: **Perception** (grounding sensory input to spatial representations), **Operation** (manipulating representations according to instructions), and **Imagination** (autonomous spatial problem-solving under geometric constraints). This decomposition isolates genuine spatial reasoning from superficial pattern matching. To mitigate evaluation biases, we employ **Winograd-style accuracy** using minimal-pair contrastive samples. Our evaluation reveals that state-of-the-art VLMs demonstrate reasonable perceptual capabilities but fail significantly at operational and imagination-level spatial reasoning, suggesting reliance on statistical patterns rather than genuine geometric understanding. Ablation studies confirm the effectiveness of our cognitively-stratified decomposition and bias-resistant evaluation methodology. EucliFold provides a rigorous testbed for probing emergent spatial priors in future models and demonstrates how systematic cognitive decomposition can reveal nuanced capability gaps in VLMs.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 25372

Loading