3D-Scene-Former: 3D scene generation from a single RGB image using Transformers

Published: 2025, Last Modified: 02 Mar 2026Vis. Comput. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: 3D scene generation requires complex hardware setups, such as multiple cameras and depth sensors. To address this challenge, there is a need for generating 3D scenes from a single RGB image by understanding the spatio-contextual information inside a scene. However, generating 3D scenes from a single RGB image represents a formidable undertaking as the depth information is missing. Moreover, we need to generate the scene from various angles and positions, which necessitates extrapolations from the limited information in a single image. Current state-of-the-art techniques hinge on extracting global and local features from the 2D scene and employ a combined estimation strategy to tackle this challenge. However, existing approaches still grapple with accurately estimating 3D parameters, especially due to the strong occlusions in cluttered environments. In this paper, we propose 3D-Scene-Former, a novel solution to generate 3D indoor scenes from a single RGB image and refine the initial estimations using a Transformer network. We evaluated our approach on two well-known datasets benchmarking it against state-of-the-art solutions. Our method outperforms the state-of-the-art in terms of 3D object detection and 3D pose estimation by a margin of 11.37%. 3D-Scene-Former opens new venues for 3D content creation, transforming a single RGB image into realistic 3D scenes through the use of interconnected mesh structures.
Loading