UMAMI: Unifying Masked Autoregressive Models and Deterministic Rendering for View Synthesis

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Novel View Synthesis, Autoregressive Models, Diffusion Models.
TL;DR: We propose a method that unifies deterministic feed‑forward rendering with autoregressive diffusion to synthesize photorealistic novel views from sparse inputs in a single transformer framework.
Abstract: Novel view synthesis (NVS) seeks to render photorealistic, 3D‑consistent images of a scene from unseen camera poses given only a sparse set of posed views. Existing deterministic networks render observed regions quickly but blur unobserved areas, whereas stochastic diffusion‑based methods hallucinate plausible content yet incur heavy training‑ and inference‑time costs. In this paper, we propose a hybrid framework that unifies the strengths of both paradigms. A bidirectional transformer encodes multi‑view image tokens and Plücker‑ray embeddings, producing a shared latent representation. Two lightweight heads then act on this representation: (i) a feed‑forward regression head that renders pixels where geometry is well constrained, and (ii) a masked autoregressive diffusion head that completes occluded or unseen regions. The entire model is trained end‑to‑end with joint photometric and diffusion losses, without handcrafted 3D inductive biases, enabling scalability across diverse scenes. Experiments demonstrate that our method attains state‑of‑the‑art image quality while reducing rendering time by an order of magnitude compared with fully generative baselines.
Supplementary Material: zip
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 25829
Loading