Feature Warping-and-Conditioning for Representation-Guided Novel View Synthesis

Feature Warping-and-Conditioning for Representation-Guided Novel View Synthesis

ICLR 2026 Conference Submission18204 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: NVS, Generative Models

Abstract: We present a novel framework for diffusion‐based novel‐view synthesis that harnesses the rich semantic and geometric representations of VGGT—a transformer model trained for multi‐view geometry prediction. Unlike existing methods that either rely on explicit 3D models (e.g., NeRF) or monocular depth estimates for guidance, our approach reformulates view synthesis as a warping‐and‐inpainting task: first, VGGT features from multiple reference views are geometrically warped into a target pose; then, a diffusion U-Net generates the final image by attending to both warped features (for accurate reconstruction of visible regions) and semantically similar cues (for plausible inpainting of occluded areas). Through an empirical analysis of DINOv2, CroCo, and VGGT features, we demonstrate that VGGT’s multiscale attention consistently delivers superior geometric correspondence and semantic coherence. Building on these insights, we design a multi‐view synthesis architecture with dedicated warping‐and‐conditioning modules that inject VGGT features into the diffusion process. Our experiments show that this design yields marked improvements in both reconstruction fidelity and inpainting quality, outperforming prior diffusion‐based novel‐view methods on standard benchmarks and enabling robust synthesis from sparse, unposed image collections.

Primary Area: generative models

Submission Number: 18204

Loading