SceneNAT: Masked Generative Modeling for Language-Guided Indoor Scene Synthesis

Jeongjun Choi; Yeonsoo Park; H. Jin Kim

SceneNAT: Masked Generative Modeling for Language-Guided Indoor Scene Synthesis

Jeongjun Choi, Yeonsoo Park, H. Jin Kim

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Indoor scene synthesis, Generative models, Non-autoregressive transformers, Conditional Generation

TL;DR: We propose SceneNAT, a masked generative model for synthesizing 3D indoor scenes conditioned on natural language instructions.

Abstract: We present SceneNAT, a single-stage masked non-autoregressive Transformer that synthesizes complete 3D indoor scenes from natural language instructions through only a few parallel decoding passes, offering improved performance and efficiency compared to prior state-of-the-art approaches. SceneNAT is trained via masked modeling over fully discretized representations of both semantic and spatial attributes. By applying a masking strategy at both the attribute level and the instance level, the model can better capture intra-object and inter-object structure. To boost relational reasoning, SceneNAT employs a dedicated triplet predictor for modeling the scene's layout and object relationships by mapping a set of learnable relation queries to a sparse set of symbolic triplets (subject, predicate, object). Extensive experiments on the 3D-FRONT dataset demonstrate that SceneNAT achieves superior performance compared to state-of-the-art autoregressive and diffusion baselines in both semantic compliance and spatial arrangement accuracy, while operating with substantially lower computational cost.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 24277

Loading