SCoT: Teaching 3D-LLMs to Think Spatially with Million-scale CoT Annotations

ICLR 2026 Conference Submission8925 Authors

Published: 26 Jan 2026, Last Modified: 26 Jan 2026ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: 3D Large Language Model, Chain-of-Thought, Spatial Perception, Spatial Analysis, Spatial Planning
TL;DR: We present a million-scale 3D visual-language dataset with CoT annotations that unifies perception, analysis, and planning tasks to advance interpretable 3D intelligence.
Abstract: Recent advances in 3D Large Language Models (3D-LLMs) show strong potential in understanding and interacting with 3D environments, yet their training data typically lack explicit reasoning processes, limiting complex spatial reasoning and task planning. To address this, we annotate SCoT, a million-scale Chain-of-Thought dataset spanning three levels: a) Spatial Perception (what is there), recognizing object properties, relations, and scene attributes; b) Spatial Analysis (what does it mean), inferring rationality, functionalities, and physical implications; c) Spatial Planning (what should I do), integrating perception and reasoning for actionable strategies. Unlike prior datasets supervising only answers, SCoT annotates intermediate reasoning grounded in scene cues, specifically for analysis and planning tasks. Results show that CoT supervision greatly benefits complex analysis and planning but induces hallucinations and accuracy drops in simple perception. These findings highlight both the necessity and the nuanced challenges of scene-grounded reasoning for advancing 3D intelligence.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 8925
Loading