SwiftHome: Fast Real-Time Multi-Floor 3D House Generation from Text

Seemandhar Jain; Rahul Vasanth

SwiftHome: Fast Real-Time Multi-Floor 3D House Generation from Text

Seemandhar Jain, Rahul Vasanth

11 Sept 2025 (modified: 15 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: text-to-3D, LLM, scene understanding, 3D vision

TL;DR: Turns free-form text into fully textured, navigable multi-floor 3D homes in ≲10 s/floor via an agentic, training-free pipeline with LLM planning and fast depth-conditioned inpainting

Abstract: We introduce SwiftHome, the first system that transforms free-form natural-language descriptions into fully textured, navigable multi-floor 3-D houses in under ten seconds per floor. Starting from a large-language-model (LLM) parse of the input text, SwiftHome assembles a hierarchical scene graph, lays out rooms across multiple stories, retrieves or generates furniture meshes, and applies style-consistent materials—all in a single forward pass. A lightweight multi-agent feedback loop couples an LLM “planner” with a rule-based “validator,” eliminating object collisions and enforcing ergonomic spacing without resorting to time-consuming diffusion optimization. Key viewpoints are then textured via a depth-conditioned inpainting module, yielding coherent, high-fidelity appearances while preserving real-time performance. SwiftHome achieves near-zero out-of-bounds object placement (3 %), high text-scene alignment (30.5 CLIP-score), and state-consistent textures, outperforming previous pipelines by two orders of magnitude in speed. An interactive interface lets users iteratively refine layouts by mixing text edits with direct object manipulation, making SwiftHome a practical tool for game design, VR/AR prototyping, and rapid architectural visualization.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 4180

Loading