[Re] Masked Autoencoders Are Small Scale Vision Learners: A Reproduction Under Resource Constraints

Athanasios Charisoudis; Simon Ekman von Huth; Emil Jansson

[Re] Masked Autoencoders Are Small Scale Vision Learners: A Reproduction Under Resource Constraints

Athanasios Charisoudis, Simon Ekman von Huth, Emil Jansson

Published: 02 Aug 2023, Last Modified: 02 Aug 2023MLRC 2022Readers: Everyone

Keywords: rescience c, python, pytorch, machine learning, deep learning, computer vision, self-supervised learning, masked autoencoder, semantic, perceptual, reproduction, replication, reproducibility, image classification, small scale

TL;DR: This is our reproducibility report for the paper "Masked Autoencoders Are Scalable Vision Learners". We also present our extension, the Semantic Masked Autoencoder (SMAE).

Abstract: Scope of Reproducibility — The Masked Autoencoder (MAE) was recently proposed as a framework for efficient self‐supervised pre‐training in Computer Vision [1]. In this pa‐ per, we attempt a replication of the MAE under significant computational constraints. Specifically, we target the claim that masking out a large part of the input image yields a nontrivial and meaningful self‐supervisory task, which allows training models that generalize well. We also present the Semantic Masked Autoencoder (SMAE), a novel yet simple extension of MAE which uses perceptual loss to improve encoder embeddings. Methodology — The datasets and backbones we rely on are significantly smaller than those used by [1]. Our main experiments are performed on Tiny ImageNet (TIN) [2] and trans‐ fer learning is performed on a low‐resolution version of CUB‐200‐2011 [3]. We use a ViT‐Lite [4] as backbone. We also compare the MAE to DINO, an alternative frame‐ work for self‐supervised learning [5]. The ViT, MAE, as well as perceptual loss were implemented from scratch, without consulting the original authors’ code. Our code is available at https://github.com/MLReproHub/SMAE. The computational budget for our reproduction and extension was approximately 150 GPU hours. Results — This paper successfully reproduces the claim that the MAE poses a nontrivial and meaningful self‐supervisory task. We show that models trained with this frame‐ work generalize well to new datasets and conclude that the MAE is reproducible with exception for some hyperparameter choices. We also demonstrate that MAE performs well with smaller backbones and datasets. Finally, our results suggest that the SMAE extension improves the downstream classification accuracy of the MAE on CUB (+5 pp) when coupled with an appropriate masking strategy. What was easy — Given prior experience with a deep learning framework, re‐implementing the paper was relatively straightforward, with sufficient details given in the paper. What was difficult — We faced challenges implementing efficient patch shuffling and tun‐ ing hyperparameters. The hyperparameter choices from [1] did not translate well to a smaller dataset and backbone. Communication with original authors — We have not had contact with the original authors.

Paper Url: https://openaccess.thecvf.com/content/CVPR2022/html/He_Masked_Autoencoders_Are_Scalable_Vision_Learners_CVPR_2022_paper.html

Paper Venue: CVPR 2022

Confirmation: The report pdf is generated from the provided camera ready Google Colab script, The report metadata is verified from the camera ready Google Colab script, The report contains correct author information., The report contains link to code and SWH metadata., The report follows the ReScience latex style guides as in the Reproducibility Report Template (https://paperswithcode.com/rc2022/registration)., The report contains the Reproducibility Summary in the first page., The latex .zip file is verified from the camera ready Google Colab script

Latex: zip

Journal: ReScience Volume 9 Issue 2 Article 40

Doi: https://www.doi.org/10.5281/zenodo.8173751

Code: https://archive.softwareheritage.org/swh:1:dir:4d37d466bafc5dc45bf5ba68caa53f207e6d0702

0 Replies

Loading