Keywords: self-supervised learning, masked autoencoder, efficient training
Abstract: Masked Autoencoders (MAE), introduced by (He et al., 2022), provides a strong framework to pre-train Vision Transformers (ViTs). In this paper, we accelerate MAE training by 59× or more while with little performance drop. Our changes are simple and straightforward: in the pre-training stage, we aggressively increase the masking ratio, decrease the training epochs, and reduce the decoder depth, for lowering pre-training cost; in the fine-tuning stage, we reveal layer-wise learning rate decay plays a vital role on unleashing the power of pre-trained models. With this setup, we are able to pre-train a ViT-B in 12.6 hours using a single the latest NVIDIA A100 GPU, which competitively attains 83.0% top-1 accuracy on the downstream ImageNet classification task. We additionally verify the speed acceleration on another MAE extension, SupMAE.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning
TL;DR: we significantly accelerate MAE training by 59x or more
5 Replies
Loading