Keywords: scene generation, depth inpainting
Abstract: 3D scene generation has quickly become a challenging new research direction, fueled by consistent improvements of 2D generative diffusion models. Current methods generate scenes by iteratively stitching newly generated images with existing geometry, using pre-trained monocular depth estimators to lift the generated images to 3D. The predicted depth is fused with the existing scene representation through various alignment operations. In this work, we make two fundamental contributions to the field of 3D scene generation. First, we note that lifting images to 3D with a monocular depth estimation model is suboptimal as it ignores the geometry of the existing scene, thus prompting the need for alignment. We introduce a depth completion model to directly learn the 3D fusion process, resulting in improved geometric coherence of generated scenes. Second, we introduce a new benchmark to evaluate the geometric accuracy of scene generation methods. We show that the commonly used CLIP score between scene prompts and images is unsuitable for measuring the geometric quality of a scene and introduce a depth-based metric. Our benchmark thus offers an additional dimension to gauge the quality of generated scenes.
Supplementary Material: zip
Submission Number: 387
Loading