Keywords: rescience c, python, pytorch, machine learning, deep learning, computer vision, self-supervised learning, masked autoencoder, semantic, perceptual, reproduction, replication, reproducibility, image classification, small scale
TL;DR: This is our reproducibility report for the paper "Masked Autoencoders Are Scalable Vision Learners". We also present our extension, the Semantic Masked Autoencoder (SMAE).
Abstract: Scope of Reproducibility — The Masked Autoencoder (MAE) was recently proposed as a framework for efficient self‐supervised pre‐training in Computer Vision . In this pa‐ per, we attempt a replication of the MAE under significant computational constraints. Specifically, we target the claim that masking out a large part of the input image yields a nontrivial and meaningful self‐supervisory task, which allows training models that generalize well. We also present the Semantic Masked Autoencoder (SMAE), a novel yet simple extension of MAE which uses perceptual loss to improve encoder embeddings. Methodology — The datasets and backbones we rely on are significantly smaller than those used by . Our main experiments are performed on Tiny ImageNet (TIN)  and trans‐ fer learning is performed on a low‐resolution version of CUB‐200‐2011 . We use a ViT‐Lite  as backbone. We also compare the MAE to DINO, an alternative frame‐ work for self‐supervised learning . The ViT, MAE, as well as perceptual loss were implemented from scratch, without consulting the original authors’ code. Our code is available at https://github.com/MLReproHub/SMAE. The computational budget for our reproduction and extension was approximately 150 GPU hours. Results — This paper successfully reproduces the claim that the MAE poses a nontrivial and meaningful self‐supervisory task. We show that models trained with this frame‐ work generalize well to new datasets and conclude that the MAE is reproducible with exception for some hyperparameter choices. We also demonstrate that MAE performs well with smaller backbones and datasets. Finally, our results suggest that the SMAE extension improves the downstream classification accuracy of the MAE on CUB (+5 pp) when coupled with an appropriate masking strategy. What was easy — Given prior experience with a deep learning framework, re‐implementing the paper was relatively straightforward, with sufficient details given in the paper. What was difficult — We faced challenges implementing efficient patch shuffling and tun‐ ing hyperparameters. The hyperparameter choices from  did not translate well to a smaller dataset and backbone. Communication with original authors — We have not had contact with the original authors.
Paper Url: https://openaccess.thecvf.com/content/CVPR2022/html/He_Masked_Autoencoders_Are_Scalable_Vision_Learners_CVPR_2022_paper.html
Paper Venue: CVPR 2022
Confirmation: The report pdf is generated from the provided camera ready Google Colab script, The report metadata is verified from the camera ready Google Colab script, The report contains correct author information., The report contains link to code and SWH metadata., The report follows the ReScience latex style guides as in the Reproducibility Report Template (https://paperswithcode.com/rc2022/registration)., The report contains the Reproducibility Summary in the first page., The latex .zip file is verified from the camera ready Google Colab script
Journal: ReScience Volume 9 Issue 2 Article 40