CAN: A simple, efficient and scalable contrastive masked autoencoder framework for learning visual representations

Shlok Kumar Mishra; Joshua David Robinson; Huiwen Chang; David Jacobs; Weicheng Kuo; Aaron Sarna; Aaron Maschinot; Dilip Krishnan

CAN: A simple, efficient and scalable contrastive masked autoencoder framework for learning visual representations

Shlok Kumar Mishra, Joshua David Robinson, Huiwen Chang, David Jacobs, Weicheng Kuo, Aaron Sarna, Aaron Maschinot, Dilip Krishnan

Published: 01 Feb 2023, Last Modified: 13 Feb 2023Submitted to ICLR 2023Readers: Everyone

Keywords: Self supervised learning, contrastive learning, masked autoencoders

TL;DR: We propose a minimal and conceptually clean synthesis of (C) contrastive learning, (A) masked autoencoders, and (N) the noise prediction for self-supervised learning on images

Abstract: We introduce CAN, a simple, efficient and scalable method for self-supervised learning of visual representations. Our framework is a minimal and conceptually clean synthesis of (C) contrastive learning, (A) masked autoencoders, and (N) the noise prediction approach used in diffusion models. The learning mechanisms are \emph{complementary} to one another: contrastive learning shapes the embedding space across a batch of image samples; masked autoencoders focus on reconstruction of the low-frequency spatial correlations in a single image sample; and noise prediction encourages the reconstruction of the high-frequency components of an image. The combined approach results in a robust, scalable and simple-to-implement algorithm. The training process is symmetric, with $50\%$ of patches in \emph{both views} being masked at random, yielding a considerable efficiency improvement over prior contrastive learning methods. Extensive empirical studies on linear evaluation, finetuning, transfer learning, and robustness demonstrate that our approach achieves strong downstream performance. For instance, when pre-training ViT-B encoders on the curated ImageNet dataset, CAN achieves $74.8\%$ top-1 linear probing accuracy, an absolute improvement of $6.8\%$ over MAE and $1.3\%$ over SimCLR with the same architecture and data augmentations. CAN is especially useful for pre-training on larger uncurated datasets such as JFT-300M: the finetuned performance on ImageNet of our ViT-L model is $85.9\%$, compared to $85.0\%$ for SimCLR, and $85.4\%$ for MAE. For linear probe on ImageNet, CAN achieves $75.4\%$ compared to $71.8\%$ for SimCLR and $64.1\%$ for MAE. The overall FLOPs load is $41\%$ \emph{lower} than SimCLR\footnote{Our code will be released at \url{www.xxx.yyy}.}.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Unsupervised and Self-supervised learning

Supplementary Material: zip

14 Replies

Loading