Keywords: Self-supervised Learning, Group Masked Model Learning, Masked Autoencoders, Vision Transformers.
Abstract: Vision transformers have generated significant interest in the computer vision (CV) community because of their flexibility in exploiting contextual information, whether it is sharply confined local, or long range global. However, they are known to be data hungry. This has motivated the research in self-supervised transformer pretraining, which does not need to decode the semantic information conveyed by labels to link it to the image properties, but rather focuses directly on extracting a concise representation of the image data that reflects the notion of similarity and is invariant to nuisance factors. The key vehicle for the self-learning process used by the majority of self-learning methods is the generation of multiple views of the training data and the creation of pretext tasks which use these views to define the notion of image similarity and data integrity. However, this approach lacks the natural propensity to extract contextual information. We propose group mask model learning (GMML), a self-supervised learning (SSL) mechanism for pretraining vision transformers with the ability to extract the contextual information present in all the concepts in an image. GMML achieves this by manipulating random groups of connected tokens, ensuingly covering a meaningful part of a semantic concept, and then recovering the hidden semantic information from the visible part of the concept. GMML implicitly introduces a novel data augmentation process. Unlike most of the existing SSL approaches, GMML does not require momentum encoder, nor rely on careful implementation details such as large batches and gradient stopping, which are all artefacts of most of the current self-supervised learning techniques. Since its conception at the beginning of 2021, GMML maintains itself as unbeaten SSL method with several desirable benefits and marked a significant milestone in computer vision by being one of the first self-supervised pretraining methods which outperform supervised pretraining consistently with a large margin. GMML is simple, elegant, and currently the best mechanism to extract information from a given dataset and instil this information into transformer's weights. The code will be made publicly available for the community to train on bigger corpora.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Unsupervised and Self-supervised learning
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/gmml-is-all-you-need/code)
6 Replies
Loading