Keywords: Agent, 3D Scene Generation, Multimodal Large Language Models
TL;DR: We propose an agent-based, sequential object placement framework for 3D indoor scene generation supports incremental editing and uses a solver fusing semantic and visual cues to produce more plausible, consistent, and realistic layouts.
Abstract: Interactive 3D indoor scene generation is a crucial task with applications in embodied AI, virtual reality, and physics-based simulation. To enable the generated scenes that can be directly imported into off-the-shelf 3D engines, most prior work follows a retrieve-then-place pipeline. These systems typically combine large language models with traditional procedural content generation pipelines. While effective for one-shot generation of complete scenes, they lack incremental editability: inserting a new object often triggers global re-optimization, and localized re-layout is not natively supported. Moreover, most methods produce a semantic scene graph via an LLM, ignoring visual cues that naturally encode spatial relations. In this paper, we present an agent-based approach to scene layout generation that places objects sequentially. Conditioned on user instructions, we first retrieve relevant 3D assets, then iteratively select an object, predict its position and orientation, and place it in the scene. Each decision is conditioned on the current scene state, enabling flexible placement and incremental editing, including object insertion and local rearrangement. We further introduce a layout solver that fuses semantic scene-graph constraints with visual cues, substantially improving spatial plausibility and global consistency. Extensive experiments show that our method performs superior layout aesthetics and functional realism.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 16924
Loading