Unleashing Chain-of-Thought Reasoning for 3D Scene Synthesis

Mingyu Wu; Mingsheng Li; zhongyuan liu; Peng Guo; Renqiu Xia; Wenzheng Wu; Jiayuan Fan

Unleashing Chain-of-Thought Reasoning for 3D Scene Synthesis

Mingyu Wu, Mingsheng Li, zhongyuan liu, Peng Guo, Renqiu Xia, Wenzheng Wu, Jiayuan Fan

19 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: 3D Scene Synthesis; Large Language Models; Chain-of-thought Reasoning

Abstract: Recently, 3D Scene Synthesis (3DSS) has attracted growing interest for its applications in autonomous intelligent systems. However, conventional methods rely heavily on manual effort and expert knowledge, and are often criticized for their limited quantity and diversity. On the other hand, large language models (LLMs) have achieved remarkable performance across a wide range of tasks, making automatic 3DSS from textual conditions feasible. However, due to the lack of spatial reasoning capabilities, they still face significant challenges in generating coherent 3D scenes, often resulting in inappropriate objects, incorrect arrangements, and spatial conflicts. In this paper, for the first time, we explore the potential of chain-of-thought (CoT) reasoning in 3DSS and propose an innovative approach to enhance the spatial reasoning and generation capabilities of LLMs. Specifically, we introduce a cascading Scene-CoT generation pipeline that decomposes complex design tasks into manageable subtasks through hierarchical agents, complemented by an iterative spatial optimization strategy to resolve conflicting constraints commonly encountered by LLMs. Through the interaction between semantic reasoning agents and spatial constraint optimization modules, our approach is capable of generating 3D scenes accompanied by reasoning traces and computational processes. Furthermore, we propose a two-stage progressive training framework that first distills a base model using Scene-CoT to acquire initial scene generation capabilities, and then improves the coherence and plausibility of 3D scene generation through reinforcement learning. Extensive experiments demonstrate the effectiveness of our Scene-CoT dataset and model in enabling high-quality automatic 3D scene synthesis.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 19064

Loading