A Unified View of Masked Image Modeling
Abstract: Masked image modeling has demonstrated great potential to eliminate the label-hungry problem of training large-scale vision Transformers, achieving impressive performance on various downstream tasks. In this work, we propose a unified view of masked image modeling after revisiting existing methods. Under the unified view, we introduce a simple yet effective method, termed as MaskDistill, which reconstructs normalized semantic features from teacher models at the masked positions, conditioning on corrupted input images. Experimental results on image classification and semantic segmentation show that MaskDistill achieves comparable or superior performance than state-of-the-art methods. When using the huge vision Transformer and pretraining 300 epochs, MaskDistill obtains 88.3% fine-tuning top-1 accuracy on ImageNet-1k (224 size) and 58.8 semantic segmentation mIoU metric on ADE20k (512 size). Code is enclosed in the supplementary materials.
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Based on the suggestions from AE, we improved our manuscript by conducting an additional downstream task (few-shot learning) and comparing it with other popular methods. Our observations are consistent with the findings presented in the paper "How Well Do Self-Supervised Models Transfer?". The results are presented in Table 6 of the camera-ready version, and the corresponding analysis is presented in Section 4.5.
Assigned Action Editor: ~Joao_Carreira1
Submission Number: 607