MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation LearningDownload PDF

22 Sept 2022 (modified: 25 Nov 2024)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone
Keywords: self-supervised learning, masked image modeling
Abstract: In this study, we propose Mixed and Masked Image Modeling (MixMIM), a simple but efficient MIM method that is applicable to various hierarchical Vision Transformers. Existing MIM methods replace a random subset of input tokens with a special $\mathrm{[MASK]}$ symbol and aim at reconstructing original image tokens from the corrupted image. However, we find that using the $\mathrm{[MASK]}$ symbol greatly slows down the training and causes training-finetuning inconsistency, due to the large masking ratio (e.g., 60% in SimMIM). In contrast, we replace the masked tokens of one image with visible tokens of another image, i.e., creating a mixed image. We then conduct dual reconstruction to reconstruct the original two images from the mixed input, which significantly improves efficiency. While MixMIM can be applied to various architectures, this paper explores a simpler but stronger hierarchical Transformer, and scales with MixMIM-B, -L, and -H. Empirical results demonstrate that MixMIM can learn high-quality visual representations efficiently. Notably, MixMIM-B with 88M parameters achieves 85.1% top-1 accuracy on ImageNet-1K by pretraining for 600 epochs. Besides, its transferring performances on the other 6 datasets show MixMIM has better FLOPs / performance tradeoff than previous MIM methods.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Unsupervised and Self-supervised learning
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/mixmim-mixed-and-masked-image-modeling-for/code)
5 Replies

Loading