Masked Autoencoders Enable Efficient Knowledge DistillersDownload PDF

22 Sept 2022 (modified: 12 Mar 2024)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone
Keywords: Transformer, Pretraining, Knowledge Distillation
TL;DR: This paper studies the potential of distilling knowledge from self-supervised pre-trained models, especially Masked Autoencoders
Abstract: This paper studies the potential of distilling knowledge from self-supervised pre-trained models, especially Masked Autoencoders. Our approach is simple: in addition to optimizing the pixel reconstruction loss on masked inputs, we minimize the distance between the intermediate feature map of the teacher model and that of the student model. This design leads to a computationally efficient knowledge distillation framework, given 1) only a small visible subset of patches is used, and 2) the teacher model only needs to forward propagate inputs through the first few layers for obtaining intermediate feature maps. Compared to directly distilling fine-tuned models, distilling pre-trained models substantially improves the performance of downstream representation learning, meanwhile incurring little extra pre-training cost. For example, by distilling the knowledge from an MAE pre-trained ViT-L into a ViT-B, our method achieves an 84.0% ImageNet top-1 accuracy, outperforming the baseline of distilling a fine-tuned ViT-L by 1.2%, with no extra training time at all. More interestingly, our method can robustly tackle different masking ratios: e.g., by pushing to the extreme 95% masking ratio where merely TEN patches are visible during distillation, our ViT-B still secures a top-1 accuracy of 83.8%, meanwhile further reducing total training time by 13% of that of the distilling during fine-tuning baseline.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Unsupervised and Self-supervised learning
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/arxiv:2208.12256/code)
5 Replies

Loading