SpatialTree: Branching Out Spatial Intelligence in MLLMs via a Capability Tree

12 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Spatial Intelligence; MLLMs; Capability Taxonomy
Abstract: Spatial Intelligence (SI) is rapidly becoming a cornerstone capability for MLLMs, enabling them to seamlessly perceive, reason about, and interact with complex 3D environments — a critical step towards truly embodied AI systems. However, previous works typically focus on a few specific 3D tasks, offering only a fragmented view of MLLMs’ spatial abilities. Inspired by cognitive science studys, we propose SpatialTree, a hierarchical taxonomy that organizes SI into a capability tree—from low level perception (L1), mental mapping (L2), mental simulation (L3), to high level agentic competence (L4). Building on this taxonomy, we introduce the first capability-centric benchmark that thoroughly evaluates the spatial abilities of MLLMs. Moreover, extensive experiments are conducted to investigate the compositional nature of spatial abilities, examining the dependencies among the abilities and identifying the atomic abilities that exert the greatest influence on others. Furthermore, we introduce SpatialEngine, an extensible framework that integrates 3D vision perception models with MLLMs into a progressive annotator, enabling comprehensive data annotation across the entire tree.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 4285
Loading