Abstract: Recent progress in novel view synthesis of indoor scenes using diffusion models has attracted significant attention, particularly for generating desired poses from a source image. Existing methods can produce plausible views near the input view, but they often fail to extrapolate views far beyond the input perspective. Additionally, achieving a multiview consistent diffusion model typically requires training of computationally intensive 3D priors, limiting scalability to long-range generation. In this paper, we present a transformer-based latent diffusion model that leverages view geometry constraints, including explicitly warped feature maps of the input view as the denoised target view and a conditioning combination of epipolar-weighted source image feature map, Plücker raymap, and camera poses. This approach allows for the extrapolation of consistent novel views, both semantically and geometrically, over long-range trajectories in a single-shot manner. Our model is evaluated on two indoor datasets, ScanNet and RealEstate10K, using a diverse set of metrics for view quality and consistency evaluation. Experimental results demonstrate the superiority of our approach over existing models, showcasing its potential for semantically and geometrically consistent novel view synthesis, scalable in video generation applications.
External IDs:dblp:conf/ijcnn/KangXZK25
Loading