MindBlock: Probing Spatial Assembly and Structure in Unified Multimodal Models

Baiqiao Yin; Junhao Liu; Han Yin; Heyang Yu; Tingxuan Zhang; Zhiheng Li; Chengzu Li; Jihan Yang; Manling Li; Chen Feng; Yiming Li

MindBlock: Probing Spatial Assembly and Structure in Unified Multimodal Models

Baiqiao Yin, Junhao Liu, Han Yin, Heyang Yu, Tingxuan Zhang, Zhiheng Li, Chengzu Li, Jihan Yang, Manling Li, Chen Feng, Yiming Li

Published: 09 May 2026, Last Modified: 09 May 2026MUSIEveryoneRevisionsCC BY 4.0

Keywords: Spatial Intelligence, Unified MLLM

Abstract: While Unified Multimodal Models (UMMs) show remarkable reasoning capabilities, their spatial intelligence remains limited to passive 2D Question-Answering (QA). In this paper, we argue true spatial intelligence demands active construction: not only recognize a 3D structure in pixel space, but building and modifying it. We introduce MindBlock to challenge models' active generative construction in pixel space across two primary axes: \textbf{Spatial Assembly} evaluating step-by-step compositional and causal reasoning, and Spatial Structure probing spatial equivariance through local sub-component rotation and global viewpoint transformation. To move beyond pixel-level metrics, we propose 3DGS-Eval, a novel validation protocol using 3D Gaussian Splatting to reconstruct implicit scenes from model-generated multi-view images. This allows us to quantify structural consistency, verifying for the first time whether a model's generative output admits a coherent internal 3D world model. Furthermore, we conduct a deep-dive diagnostic analysis into the representational grounding of spatial logic, disentangling whether structural consistency relies on textual Chain-of-Thought (CoT) as a symbolic scaffold, or emerges as a native spatial intuition within the generative latent space. Our findings reveal a significant ``perception-execution'' gap: while current models correctly identify the intended spatial state yet fail in active construction. They struggle to maintain spatial equivariance without explicit symbolic scaffolding. MindBlock provides a rigorous foundation for the next generation of embodied, physically-grounded multimodal AI.

Previously Accepted: No

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 29

Loading