CDG-MAE: Learning Correspondences from Diffusion Generated Views

ICLR 2026 Conference Submission13280 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Self-supervised Learning, Diffusion, Masked Autoencoders
Abstract: Dense correspondences are critical for applications such as video label propagation, but learning them is hard because of tedious and unscalable manual annotation needs. Self-supervised methods address this by using a cross-view pretext task, often modeled with a masked autoencoder, where a masked target view is reconstructed from an anchor view. However, acquiring effective training data remains a challenge - collecting diverse video datasets is costly, while simple image crops lack the necessary pose variations, underperforming video-based methods. This paper introduces CDG-MAE, a novel MAE-based self-supervised method that uses diverse synthetic views generated from static images via an image-conditioned diffusion model. We present a quantitative method to evaluate the local and global consistency of the generated views to choose the right diffusion model for cross-view self-supervised pretraining. These generated views exhibit substantial changes in pose and perspective, providing a rich training signal that overcomes the limitations of video and crop-based anchors. Furthermore, we enhance the standard single-anchor MAE setting to a multi-anchor masking strategy to increase the difficulty of the pretext task. CDG-MAE substantially narrows the gap to video-based MAE methods, while maintaining the data advantages of image-only MAEs.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 13280
Loading