Keywords: Corruption, C2R, Self-Supervised Learning, Pre-training, Masked Image Modeling, Denoising Diffusion Model
TL;DR: We study how corruption should be used in SSL, focusing on C2R pretraining with masking and noise.
Abstract: We study how corruption design—masking and additive noise—affects self-supervised pretraining of vision models. Although denoising diffusion models succeed in generation, noise-driven extensions of masked image modeling (MIM) achieve only marginal gains on recognition tasks, including fine-grained benchmarks. We thus investigate why this would be the case, seeking effective ways to combine masking and noising within the corruption-to-reconstruction (C2R) paradigm. We begin by analyzing prior noise-based MIM approaches, categorizing them into Substitutive Corruption (masked tokens replaced by noised ones) and Conjunctive Corruption (masked and noised tokens coexist), and further into Encoder- or Decoder-style depending on where corruption and restoration occur. Our study reveals that the literature trends toward a Decoder-style design. In contrast, we evaluate an Encoder-style alternative with a focus on transfer. Building on these analyses, we propose three principles for effective C2R pretraining: corruption and restoration should occur within the encoder, noise is most effective when injected at the feature level, and mask reconstruction and de-noising must be explicitly disentangled to avoid interference. By implementing these findings, we propose a framework that captures a broader frequency spectrum of representations and improves transferability, surpassing MIM by up to 8.1% and recent noise-driven pretraining methods by 8.0% across diverse recognition benchmarks. Code is available in the Supplementary Material.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 16244
Loading