Abstract: We introduce CAN, a simple, efficient and scalable method for self-supervised
learning of visual representations. Our framework is a minimal and conceptually
clean synthesis of (C) contrastive learning, (A) masked autoencoders, and (N) the
noise prediction approach used in diffusion models. The learning mechanisms are
complementary to one another: contrastive learning shapes the embedding space
across a batch of image samples; masked autoencoders focus on reconstruction of
the low-frequency spatial correlations in a single image sample; and noise prediction encourages the reconstruction of the high-frequency components of an image.
The combined approach results in a robust, scalable and simple-to-implement algorithm. The training process is symmetric, with 50% of patches in both views
being masked at random, yielding a considerable efficiency improvement over
prior contrastive learning methods. Extensive empirical studies demonstrate that
CAN achieves strong downstream performance under both linear and finetuning
evaluations on transfer learning and robustness tasks. CAN outperforms MAE and
SimCLR when pre-training on ImageNet, but is especially useful for pre-training
on larger uncurated datasets such as JFT-300M: for linear probe on ImageNet,
CAN achieves 75.4% compared to 73.4% for SimCLR and 64.1% for MAE. The
finetuned performance on ImageNet of our ViT-L model is 86.1%, compared to
85.5% for SimCLR, and 85.4% for MAE. The overall FLOPs load of SimCLR is
70% higher than CAN for ViT-L models1
0 Replies
Loading