ChatRearrange: Learning Text-guided 3D Scene Rearrangement

ICLR 2026 Conference Submission18488 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: 3D Vision, LLM, Chain-of-Thought
TL;DR: We propose a novel framework called ChatRearrange to achieve the goal of rearranging a messy room based on a text description.
Abstract: In this paper, we propose ChatRearrange, a novel framework for the text-guided 3D scene rearrangement task. Instructing an algorithm with a text description to rearrange 3D furniture objects remains an unsolved problem in the 3D field. Unfortunately, developing algorithms to address this problem presents two critical challenges. First, we lack appropriate text-labeled scene data for the training procedure. Second, evaluating performance is challenging due to the absence of appropriate benchmarks. To address the first issue, we propose the ChatRearrange framework, which includes an LLM-based Inverse Distillation algorithm, enabling ChatRearrange to train without description-labeled scene data. Additionally, we incorporate a novel gradient-field-based student network to learn the text-3D knowledge from the LLM. For the second challenge, we benchmark the text-guided 3D scene rearrangement task by proposing a new dataset called TextRoom. We also include various metrics for the evaluation. The results show that our algorithm outperforms other baselines by a large margin. We are committed to releasing all the code and dataset if the paper is accepted.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 18488
Loading