Keywords: Self-supervised learning, Empirical study
Abstract: The combination of transformers and masked image modeling (MIM) pre-training framework has shown remarkable potential in various vision tasks. However, the high computational cost of pre-training hinders the practical application of MIM.
This paper introduces \emph{FastMIM}, a simple and versatile framework that expedites masked image modeling through two steps: (i) pre-training vision backbones using low-resolution input images and (ii) reconstructing Histograms of Oriented Gradients (HOG) feature instead of original RGB values of the input images.
Furthermore, we propose \emph{FastMIM-P}, which progressively increases the input resolution during the pre-training stage to improve the transfer learning performance of models with high capacity. We point out that: (i) a wide range of input resolutions during pre-training can result in similar performances in fine-tuning and downstream tasks such as detection and segmentation; (ii) the shallow layers of encoder are more important during pre-training, and discarding the last few layers can speed up the training process without affecting fine-tuning performance; and (iii) HOG is more stable than RGB values when transferring resolution. Equipped with \emph{FastMIM}, any type of vision backbone can be efficiently pre-trained. For example, using ViT-B/Swin-B as backbones, we achieve 83.8\%/84.1\% top-1 accuracy on ImageNet-1K. Compared to previous approaches, our method can achieve better top-1 accuracy while accelerating the training procedure by 5×.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 774
Loading