Abstract: Hybrid self-supervised learning methods that combine masked image modelling and contrastive learning have demonstrated state-of-the-art performance across many vision tasks. In this work we identify a property overlooked by previous hybrid methods: they can achieve considerable efficiency improvements compared to contrastive learning, whilst still outperforming the constituent contrastive and masked image modelling training components. To demonstrate this, we introduce CAN a minimal and conceptually clean synthesis of (C) contrastive learning, (A) masked autoencoders, and (N) the noise prediction approach used in diffusion models. CAN is designed to be efficient, masking 50\% of patches in \emph{both} views, meaning that the overall FLOPs load of SimCLR is 70\% higher than CAN for ViT-L backbones. Our combined approach outperforms its MAE and SimCLR constituent parts on an extensive set of downstream transfer learning and robustness tasks under both linear probe and finetune protocols, and pre-training on large scale datasets such as JFT-300M and ImageNet-21K. Code is provided in the supplementary material, and will be publicly released.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Fixed broken references to tables in the appendix.
Assigned Action Editor: ~Yale_Song1
Submission Number: 1253
Loading