SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass

Published: 05 Nov 2025, Last Modified: 30 Jan 20263DV 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: 3D asset generation, 3D scene generation
Abstract: 3D content generation has recently attracted significant research interest, driven by its critical applications in VR/AR and embodied AI. In this work, we tackle the challenging task of **synthesizing multiple 3D assets within a single scene image**. Concretely, our contributions are fourfold: (i) we present **SceneGen**, a novel framework that takes a scene image and corresponding object masks as input, simultaneously producing multiple 3D assets with geometry and texture. Notably, SceneGen operates with no need for extra optimization or asset retrieval; (ii) we introduce a novel **feature aggregation** module that integrates local and global scene information from visual and geometric encoders within the **feature extraction** module. Coupled with a **position head**, this enables the generation of 3D assets and their relative spatial positions in a single feedforward pass; (iii) we demonstrate SceneGen's direct extensibility to multi-image input scenarios. Despite being trained solely on single-image inputs, our architecture yields improved generation performance when multiple images are provided; and (iv) extensive quantitative and qualitative evaluations confirm the efficiency and robustness of our approach. We believe this paradigm offers a novel solution for high-quality 3D content generation, potentially advancing its practical applications in downstream tasks. The code and model will be publicly available.
Supplementary Material: pdf
Submission Number: 285
Loading