Unlocking the Video Prior for High Fidelity Sparse Multi-View Image Synthesis

Fan Yang; Jianfeng Zhang; Jun Hao Liew; Chaoyue Song; Zhongcong Xu; Jiashi Feng; Guosheng Lin

Unlocking the Video Prior for High Fidelity Sparse Multi-View Image Synthesis

Fan Yang, Jianfeng Zhang, Jun Hao Liew, Chaoyue Song, Zhongcong Xu, Jiashi Feng, Guosheng Lin

Published: 05 Nov 2025, Last Modified: 30 Jan 20263DV 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: novel view synthesis, video diffusion model, 3D generation

Abstract: The development of multi-view image synthesis is constrained by the scarcity of training data. One promising solution is to finetune well-trained video generative models to synthesize 360-degree videos of objects. While these methods benefit from the strong generative priors inherited from the pretrained knowledge, they are limited by the high computational costs incurred by the large number of viewpoints. Existing methods commonly adopt temporal attention mechanism to address this. However, these methods suffer from undesirable artifacts such as 3D inconsistency and over-smoothing in the generated results. In this paper, we introduce a novel approach to unlock the video priors for multi-view synthesis by reducing generation into a sparser yet more precise process. Specifically, we introduce two strategies to achieve this: i) Condensing the video diffusion model to synthesize highly consistent sparse multi-view images. ii) Extracting dense geometrical priors from the pretrained video diffusion models to enhance the generation stability. The combination of these two strategies formulates a novel framework for multi-view synthesis, which is capable of synthesizing highly consistent sparse multi-view images with strong generalization ability. Extensive experiments demonstrate that our approach achieves superior efficiency, generalization, and consistency, outperforming state-of-the-art multi-view synthesis methods.

Supplementary Material: pdf

Submission Number: 286

Loading