Mimic before Reconstruct: Enhance Masked Autoencoders with Feature Mimicking

Peng Gao; Renrui Zhang; Hongyang Li; Hongsheng Li; Yu Qiao

Mimic before Reconstruct: Enhance Masked Autoencoders with Feature Mimicking

Peng Gao, Renrui Zhang, Hongyang Li, Hongsheng Li, Yu Qiao

22 Sept 2022 (modified: 13 Feb 2023)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone

Keywords: Masked Autoencoders, Masked Convolution, Off-the-shelf pertained model DINO and CLIP

Abstract: Masked Autoencoders (MAE) have been popular paradigms for large-scale vision representation pre-training. However, MAE solely reconstructs the low-level RGB signals after the decoder and lacks supervision upon high-level semantics for the encoder, thus suffering from sub-optimal learned representations and long pre-training epochs. To alleviate this, previous methods simply replace the pixel reconstruction targets of 75% masked tokens by encoded features from pre-trained image-image (DINO) or image-language (CLIP) contrastive learning. Different from those efforts, we propose to Mimic before Reconstruct for Masked Autoencoders, named as MR-MAE, which jointly learns high-level and low-level representations without interference during pre-training. For high-level semantics, MR-MAE employs a mimic loss over 25% visible tokens from the encoder to capture the pre-trained patterns encoded in CLIP and DINO. For low-level structures, we inherit the reconstruction loss in MAE to predict RGB pixel values for 75% masked tokens after the decoder. As MR-MAE applies high-level and low-level targets respectively at different partitions, the learning conflicts between them can be naturally overcome and contribute to superior visual representations for various down-stream tasks. On ImageNet-1K, the MR-MAE base pre-trained for only 200 epochs achieves 85.0\% top-1 accuracy after fine-tuning, surpassing MAE base pre-trained for 1600 epochs by +1.4%. Furthermore, by appending masked convolution stages, MR-MCMAE reaches 85.8%, better than previous state-of-the-art BEiT V2 base by +0.3% with much fewer computational resources (25% vs 100% tokens fed in the encoder, and 400 vs 1600 pre-training epochs). Code and pre-trained models will be released.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Unsupervised and Self-supervised learning

5 Replies

Loading