Masked Autoencoding Does Not Help Natural Language Supervision at Scale

Floris Weers, Vaishaal Shankar, Angelos Katharopoulos, Yinfei Yang, Tom Gunter

Published: 01 Jan 2023, Last Modified: 29 Sept 2024CVPR 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Self supervision and natural language supervision have emerged as two exciting ways to train general purpose image encoders which excel at a variety of downstream tasks. Recent works such as M3AE [31] and SLIP [63] have suggested that these approaches can be effectively combined, but most notably their results use small <20M examples) pre-training datasets and don't effectively reflect the large-scale regime (> 100M samples) that is commonly used for these approaches. Here we investigate whether a similar approach can be effective when trained with a much larger amount of data. We find that a combination of two state of the art approaches: masked autoencoders, MAE [37] and contrastive language image pretraining, CLIP [68] provides a benefit over CLIP when trained on a corpus of 11.3M image-text pairs, but little to no benefit (as evaluated on a suite of common vision tasks) over CLIP when trained on a large corpus of 1.4B images. Our work provides some much needed clarity into the effectiveness (or lack thereof) of self supervision for large-scale image-text training.