CDG-MAE: Cross-view Masked Modeling using Diffusion Generated Views

TMLR Paper8197 Authors

31 Mar 2026 (modified: 25 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Cross-view masked autoencoding has emerged as a powerful pretext task for learning dense correspondences, which are essential for applications such as video label propagation. The cross-view pretext task is modeled with a masked autoencoder, where a masked target view is reconstructed from an anchor view. However, acquiring effective training data remains a challenge - collecting diverse video datasets is costly, while simple image crops lack the necessary pose variations, underperforming video-based methods. This paper introduces CDG-MAE, a novel MAE-based self-supervised method that uses diverse synthetic views generated from static images via an image-conditioned diffusion model. We present a quantitative method to evaluate the local and global consistency of the generated views to choose the right diffusion model for cross-view self-supervised pretraining. These generated views exhibit substantial changes in pose and perspective, providing a rich training signal that overcomes the limitations of video and crop-based anchors. Furthermore, we enhance the standard single-anchor MAE setting to a multi-anchor masking strategy to increase the difficulty of the pretext task. CDG-MAE substantially narrows the gap to video-based MAE methods, while maintaining the data advantages of image-only MAEs.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: All changes are highlighted in blue.
Assigned Action Editor: ~Lu_Jiang1
Submission Number: 8197
Loading