Keywords: 3D vision, diffusion, indoor scene generation
Abstract: Automatically generating complex, realistic, and interactive indoor scenes from user prompts remains a formidable challenge, requiring scalability to multi-room environments, physical plausibility, controllable editing, and minimal human intervention. Existing paradigms, such as text-to-3D synthesis and layout-based retrieval, provide complementary advantages but suffer from limited automation, structural incompleteness, suboptimal textures, and inefficiency in large-scale settings.
To overcome these limitations, we introduce \textbf{SceneLCM}, an automatic and interactive framework that integrates Large Language Model (LLM)-driven layout generation with \textit{Latent Consistency Model} (LCM)-based scene optimization. Central to SceneLCM is the proposed \textbf{Consistency Trajectory Sampling} (CTS) loss, which maintains self-consistency during LCM optimization, enabling faster convergence and higher-fidelity 3D generation with theoretically bounded distillation error.
Built upon CTS-guided LCMs, the SceneLCM pipeline comprises three key stages: (1) \textbf{Layout Generation} — LLM-guided 3D spatial reasoning transforms textual descriptions into parametric floorplans and object configurations, refined via iterative programmatic verification and cluster-based object orientation; (2) \textbf{Scene Object Generation} — objects are represented as 3D Gaussians and optimized with CTS for efficient, photorealistic results; (3) \textbf{Environment Optimization} — a normal-aware texture field encodes multi-resolution scene appearance, optimized with CTS along Zigzag camera trajectories to ensure geometric and texture coherence.
Extensive experiments demonstrate that SceneLCM produces high-quality, diverse, and physically coherent single- and multi-room scenes, while supporting texture editing and physically plausible modifications. Ablation studies validate the critical role of CTS in enabling high-quality, rapid generation across all scene components. The implementation will be publicly released to support reproducibility and foster further research.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 11268
Loading