Keywords: Spatial Intelligence, Unified MLLM
Abstract: While Unified Multimodal Models (UMMs) show remarkable reasoning capabilities, their spatial intelligence remains limited to passive 2D Question-Answering (QA). In this paper, we argue true spatial intelligence demands active construction: not only recognize a 3D structure in pixel space, but building and modifying it. We introduce MindBlock to challenge models' active generative construction in pixel space across two primary axes: \textbf{Spatial Assembly} evaluating step-by-step compositional and causal reasoning, and Spatial Structure probing spatial equivariance through local sub-component rotation and global viewpoint transformation.
To move beyond pixel-level metrics, we propose 3DGS-Eval, a novel validation protocol using 3D Gaussian Splatting to reconstruct implicit scenes from model-generated multi-view images. This allows us to quantify structural consistency, verifying for the first time whether a model's generative output admits a coherent internal 3D world model. Furthermore, we conduct a deep-dive diagnostic analysis into the representational grounding of spatial logic, disentangling whether structural consistency relies on textual Chain-of-Thought (CoT) as a symbolic scaffold, or emerges as a native spatial intuition within the generative latent space. Our findings reveal a significant ``perception-execution'' gap: while current models correctly identify the intended spatial state yet fail in active construction. They struggle to maintain spatial equivariance without explicit symbolic scaffolding. MindBlock provides a rigorous foundation for the next generation of embodied, physically-grounded multimodal AI.
Previously Accepted: No
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 29
Loading