Multi-view Geometry-Aware Diffusion Transformer for Indoor Novel View Synthesis

Published: 06 Mar 2025, Last Modified: 14 Apr 2025ICLR 2025 DeLTa Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 8 pages)
Keywords: Multiview, Geometry, Diffusion Transformer, Inpainting, Epipolar Attention, Plücker Raymap, Novel View Synthesis.
TL;DR: This paper presents a transformer-based latent diffusion model for novel view synthesis, leveraging geometric constraints to enable long-range, single-shot view extrapolation with improved consistency and scalability.
Abstract: Recent advancements in novel view synthesis for indoor scenes using diffusion models have gained significant attention, particularly for generating target poses from a single source image. While existing methods produce plausible nearby views, they struggle to extrapolate perspectives far beyond the input. Moreover, achieving multi-view consistency typically requires computationally expensive 3D priors, limiting scalability for long-range generation. In this paper, we propose a transformer-based latent diffusion model that integrates view geometry constraints to enable long-range, consistent novel view synthesis. Our approach explicitly warps input-view feature maps as the denoised target view and incorporates a conditioning combination of epipolar-weighted source image features, Plücker raymaps, and camera poses. This design allows for semantically and geometrically coherent extrapolation of novel views in a single-shot manner. We evaluate our model on the ScanNet and RealEstate10K datasets using diverse metrics for view quality and consistency. Experimental results demonstrate its superiority over existing methods, highlighting its potential for scalable, high-fidelity novel view synthesis in video generation.
Submission Number: 56
Loading