Synthesis of Interactive and Expansive Apartment Environments

Published: 09 May 2026, Last Modified: 09 May 2026MUSIEveryoneRevisionsCC BY 4.0
Keywords: 3D Scene, Diffusion model, Computer vision
Abstract: Synthesizing interactive environments at the scale of residential apartments provides a necessary foundation for advancing multimodal spatial intelligence. Current literature lacks methodologies for creating such expansive indoor spaces while maintaining the functional complexity and physical realism required for training multimodal large language models. Our framework addresses this void through a generative framework guided by natural language that integrates reasoning capabilities with posterior sampling via diffusion. This methodology utilizes spatial reasoning within language models to determine layout configurations before a floor plan is created in the scene. Integrating differentiable functions ensures that these complex layouts respect physical boundaries and prevent intersections between the moving components of different furniture pieces. Evaluations in various apartment configurations indicate that this approach creates environments with high levels of semantic consistency and operational utility. These generated worlds provide diverse data for training models in tasks requiring spatial reasoning and the manipulation of articulated objects. Achieving synthesis at the scale of a full apartment distinguishes this work from previous attempts and establishes a practical foundation for studies in multimodal understanding and embodied intelligence. Using this framework allows researchers to produce diverse and functionally interactive 3D worlds tailored to the needs of multimodal agents.
Supplementary Material: pdf
Previously Accepted: No
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 8
Loading