CausNVS: Autoregressive Multi-view Diffusion for Flexible 3D Novel View Synthesis

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: multi-view diffusion, novel view synthesis, autoregressive generation, world model
TL;DR: CausNVS is an autoregressive diffusion model for next novel view synthesis with relative pose encoded attention (CaPE) and efficient KV cache inference, towards real-time world modelling, AR streaming and interactive online generation.
Abstract: Multi-view diffusion models have shown promise in 3D novel view synthesis, but most existing methods adopt a non-autoregressive formulation. This limits their applicability in world modeling, as they only support a fixed number of views and suffer from slow inference due to denoising all frames simultaneously. To address these limitations, we propose CausNVS, a multi-view diffusion model in an autoregressive setting, which supports arbitrary input-output view configurations and generates views sequentially. We train CausNVS with causal masking and per-frame noise, using pairwise-relative camera pose encodings (CaPE) for precise camera control. At inference time, we combine a spatially-aware sliding-window with key-value caching and noise conditioning augmentation to mitigate drift. Our experiments demonstrate that CausNVS supports a broad range of camera trajectories, enables flexible autoregressive novel view synthesis, and achieves consistently strong visual quality across diverse settings.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 11830
Loading