CAMEO: Correspondence-Attention Alignment for Multi-View Diffusion Models

14 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: NVS
Abstract: We propose a novel framework designed to improve both the training efficiency and generation quality of multi-view diffusion models. While these models have emerged as a powerful paradigm for novel view synthesis (NVS) using their generative priors, they inherently lack geometry awareness as they have no 3D inductive bias. Moreover, they are typically trained with only a 2D denoising objective, leaving the learning process of geometric correspondence implicit and inefficient. In this work, we are the first to reveal that the 3D attention maps of these models exhibit an emergent property of geometric correspondence, attending to corresponding regions not only across reference views but also across target views. Furthermore, we observe that the model’s generation quality strongly correlates with the alignment between its attention maps and geometric correspondence. Motivated by these findings, we introduce CAMEO, a simple yet effective training technique that directly supervises attention maps using geometric correspondence signals. Notably, supervising a single attention layer is sufficient to guide the model toward learning accurate correspondences, resulting in accelerated convergence and improved novel view synthesis performance. Applied to the CAT3D framework, the popular multi-view diffusion architecture, CAMEO accelerates convergence by 2.0$\times$ and achieves improved novel view synthesis quality. Code and weights will be publicly released.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 4943
Loading