SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes

Xiongkun Linghu; Jiangyong Huang; Ziyu Zhu; Baoxiong Jia; Siyuan Huang

SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes

Xiongkun Linghu, Jiangyong Huang, Ziyu Zhu, Baoxiong Jia, Siyuan Huang

Published: 26 Jan 2026, Last Modified: 11 Apr 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: 3D scene reasoning, chain-of-thought reasoning, multimodal LLM

TL;DR: A step-by-step reasoning framework for 3D scene understanding

Abstract: Existing research of 3D LLMs still struggles to achieve efficient and explainable reasoning, primarily due to the under-exploration of the mechanism of human-like scene-object grounded reasoning. This paper bridges the gap by presenting a novel framework. We first introduce a Chain-of-Thought reasoning framework in 3D scenes (SceneCOT), decoupling a complex reasoning task into simpler and manageable problems, and building corresponding visual clues based on multimodal expert modules. To enable such a framework, we build the first large-scale 3D scene Chain-of-Thought reasoning dataset, SceneCOT, including more than 190k high-quality data instances. Extensive experiments across various complex 3D scene reasoning benchmarks demonstrate that our new framework achieves state-of-the-art with clear interpretability. To our knowledge, this is the first attempt to successfully implement the COT technique for achieving human-like step-by-step reasoning for 3D scene understanding, where we show great potential in extending it to a wider range of 3D scene understanding scenarios.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 18953

Loading