Masked Autoencoders Enable Efficient Knowledge Distillers

Yutong Bai; Zeyu Wang; Junfei Xiao; Chen Wei; Huiyu Wang; Alan Yuille; Yuyin Zhou; Cihang Xie

Masked Autoencoders Enable Efficient Knowledge Distillers

Yutong Bai, Zeyu Wang, Junfei Xiao, Chen Wei, Huiyu Wang, Alan Yuille, Yuyin Zhou, Cihang Xie

22 Sept 2022 (modified: 22 Jun 2025)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone

Keywords: Transformer, Pretraining, Knowledge Distillation

TL;DR: This paper studies the potential of distilling knowledge from self-supervised pre-trained models, especially Masked Autoencoders

Abstract: This paper studies the potential of distilling knowledge from self-supervised pre-trained models, especially Masked Autoencoders. Our approach is simple: in addition to optimizing the pixel reconstruction loss on masked inputs, we minimize the distance between the intermediate feature map of the teacher model and that of the student model. This design leads to a computationally efficient knowledge distillation framework, given 1) only a small visible subset of patches is used, and 2) the teacher model only needs to forward propagate inputs through the first few layers for obtaining intermediate feature maps. Compared to directly distilling fine-tuned models, distilling pre-trained models substantially improves the performance of downstream representation learning, meanwhile incurring little extra pre-training cost. For example, by distilling the knowledge from an MAE pre-trained ViT-L into a ViT-B, our method achieves an 84.0% ImageNet top-1 accuracy, outperforming the baseline of distilling a fine-tuned ViT-L by 1.2%, with no extra training time at all. More interestingly, our method can robustly tackle different masking ratios: e.g., by pushing to the extreme 95% masking ratio where merely TEN patches are visible during distillation, our ViT-B still secures a top-1 accuracy of 83.8%, meanwhile further reducing total training time by 13% of that of the distilling during fine-tuning baseline.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Unsupervised and Self-supervised learning

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/masked-autoencoders-enable-efficient/code)

5 Replies

Loading