Keywords: novel view synthesis, video diffusion model, 3D generation
Abstract: The development of multi-view image synthesis is constrained by the scarcity of training data. One promising solution is to finetune well-trained video generative models to synthesize 360-degree videos of objects. While these methods benefit from the strong generative priors inherited from the pretrained knowledge, they are limited by the high computational costs incurred by the large number of viewpoints. Existing methods commonly adopt temporal attention mechanism to address this. However, these methods suffer from undesirable artifacts such as 3D inconsistency and over-smoothing in the generated results. In this paper, we introduce a novel approach to unlock the video priors for multi-view synthesis by reducing generation into a sparser yet more precise process. Specifically, we introduce two strategies to achieve this: i) Condensing the video diffusion model to synthesize highly consistent sparse multi-view images.
ii) Extracting dense geometrical priors from the pretrained video diffusion models to enhance the generation stability. The combination of these two strategies formulates a novel framework for multi-view synthesis, which is capable of synthesizing highly consistent sparse multi-view images with strong generalization ability. Extensive experiments demonstrate that our approach achieves superior efficiency, generalization, and consistency, outperforming state-of-the-art multi-view synthesis methods.
Supplementary Material: pdf
Submission Number: 286
Loading