Keywords: 3D scene generation, training-free, overlapping patch-wise flow
Abstract: In this paper, we propose Extend3D, a novel training-free pipeline for 3D scene generation from a single image, built upon an object-centric 3D generative model. To overcome the limitations of fixed-size latent spaces of object-centric models in representing wide scenes, we extend the latent space $(a, b)$ times in $x$ and $y$ directions. Then, by dividing the extended latent into overlapping patches, we utilize the object-centric model on each patch and couple them every time step. In addition, since object-centric models are poor at sub-scene generation, we use the input image and point cloud extracted from a depth estimator as priors to enable this process. Using the point cloud prior, we initialize the structure of the scene and refine the occluded region with iterative under-noised SDEdit. Also, both priors are used to optimize the extended latent during the denoising process so that the denoising paths don't deviate from the sub-scene dynamics. We demonstrate that our method produces better results compared to the previous methods by evaluating human preferences. An ablation study shows that each component of Extend3D has a crucial role in the training-free 3D scene generation.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 11309
Loading