Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation

Published: 26 Jan 2026, Last Modified: 11 Apr 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Novel view synthesis, diffusion model
TL;DR: Cross-modal Attention Instillation for Aligned Novel View Image and Geometry Synthesis
Abstract: We introduce a diffusion-based framework that generates aligned novel view images and geometries via a warping‐and‐inpainting methodology. Unlike prior methods that require dense posed images or pose-embedded generative models limited to in‐domain views, our method leverages off‐the‐shelf geometry predictors to predict partial geometries viewed from reference images, and formulates novel view synthesis as an inpainting task for both image and geometry. To ensure accurate alignment between the generated image and geometry, we propose cross-modal attention instillation where the attention maps from an image diffusion branch are injected into a parallel geometry diffusion branch during both training and inference. This multi-task approach achieves synergistic effects, facilitating both geometrically robust image synthesis and geometry prediction. We further introduce proximity‐based mesh conditioning to integrate depth and normal cues, interpolating between point cloud and filtering erroneously predicted geometry from influencing the generation process. Empirically, our method achieves high-fidelity extrapolative view synthesis, delivers competitive reconstruction under interpolation settings, and produces geometrically aligned point clouds as 3D completion.
Primary Area: generative models
Submission Number: 5305
Loading