Keywords: video diffusion, flow matching, 3d-reconstruction, 3d-generation
TL;DR: video diffusion samples 3D by learning to improve rendering of 3D
Abstract: Reconstructing three-dimensional (3D) representations from sparse image data is a core task that requires learning to sample plausible 3D models that correspond to 2D conditioning images. Despite numerous proposed frameworks, achieving photorealistic sparse-view 3D reconstructions remains an unresolved challenge, with current methods often producing blurry results on small object-centric scenes that fall short of the fidelity achieved by dense-view 3D reconstruction and 2D generative models. This paper rethinks how to best adapt image generative models for 3D reconstruction and introduces a novel framework for 3D reconstruction. Our approach infers the 3D representation by optimizing it to match images sampled by a 2D generative model, itself conditioned on the current progress of the 3D optimization. To learn this conditional generative model, we design a new training strategy that performs 3D reconstruction using various numbers of views and captures the progress at each optimization timestep. This allows the model to explicitly learn to sample images that are consistent with the current stage of 3D reconstruction, supporting sampling of thousands of consistent images during reconstruction. Our approach decouples the 3D representation from learning a generative model, and can thus be integrated as a plug-and-play component into existing 3D reconstruction pipelines, such as Gaussian Splatting. Experiments on a challenging real-world dataset demonstrate competitive performance in single-view 3D reconstruction, performing on par with state-of-the-art 3D reconstruction methods based on 2D generative model outputs and dense mulitview images.
Submission Number: 2
Loading