Masked Autoencoders Are Scalable Vision Learners
Kaiming He;yXinlei ChenSaining Xie Yanghao Li Piotr Doll ´ar Ross Girshick
equal technical contributionyproject lead
Facebook AI Research (FAIR)
Abstract
This paper shows that masked autoencoders (MAE) are
scalable self-supervised learners for computer vision. Our
MAE approach is simple: we mask random patches of the
input image and reconstruct the missing pixels. It is based
on two core designs. First, we develop an asymmetric
encoder-decoder architecture, with an encoder that oper-
ates only on the visible subset of patches (without mask to-
kens), along with a lightweight decoder that reconstructs
the original image from the latent representation and mask
tokens. Second, we ﬁnd that masking a high proportion
of the input image, e.g., 75%, yields a nontrivial and
meaningful self-supervisory task. Coupling these two de-
signs enables us to train large models efﬁciently and ef-
fectively: we accelerate training (by 3 or more) and im-
prove accuracy. Our scalable approach allows for learning
high-capacity models that generalize well: e.g., a vanilla
ViT-Huge model achieves the best accuracy (87.8%) among
methods that use only ImageNet-1K data. Transfer per-
formance in downstream tasks outperforms supervised pre-
training and shows promising scaling behavior.
1. Introduction
Deep learning has witnessed an explosion of archi-
tectures of continuously growing capability and capacity
[33, 25, 57]. Aided by the rapid gains in hardware, mod-
els today can easily overﬁt one million images [13] and
begin to demand hundreds of millions of—often publicly
inaccessible— labeled images [16].
This appetite for data has been successfully addressed in
natural language processing (NLP) by self-supervised pre-
training. The solutions, based on autoregressive language
modeling in GPT [47, 48, 4] and masked autoencoding in
BERT [14], are conceptually simple: they remove a portion
of the data and learn to predict the removed content. These
methods now enable training of generalizable NLP models
containing over one hundred billion parameters [4].
The idea of masked autoencoders, a form of more gen-
eral denoising autoencoders [58], is natural and applicable
in computer vision as well. Indeed, closely related research
encoder....
....decoder
input target
Figure 1. Our MAE architecture . During pre-training, a large
random subset of image patches ( e.g., 75%) is masked out. The
encoder is applied to the small subset of visible patches . Mask
tokens are introduced after the encoder, and the full set of en-
coded patches and mask tokens is processed by a small decoder
that reconstructs the original image in pixels. After pre-training,
the decoder is discarded and the encoder is applied to uncorrupted
images (full sets of patches) for recognition tasks.
in vision [59, 46] preceded BERT. However, despite signif-
icant interest in this idea following the success of BERT,
progress of autoencoding methods in vision lags behind
NLP. We ask: what makes masked autoencoding different
between vision and language ? We attempt to answer this
question from the following perspectives:
(i)Until recently, architectures were different. In vision,
convolutional networks [34] were dominant over the last
decade [33]. Convolutions typically operate on regular grids
and it is not straightforward to integrate ‘indicators’ such as
mask tokens [14] or positional embeddings [57] into con-
volutional networks. This architectural gap, however, has
been addressed with the introduction of Vision Transform-
ers (ViT) [16] and should no longer present an obstacle.
(ii)Information density is different between language
and vision. Languages are human-generated signals that
are highly semantic and information-dense. When training
a model to predict only a few missing words per sentence,
this task appears to induce sophisticated language under-
standing. Images, on the contrary, are natural signals with
heavy spatial redundancy— e.g., a missing patch can be re-
covered from neighboring patches with little high-level un-
1arXiv:2111.06377v3  [cs.CV]  19 Dec 2021Figure 2. Example results on ImageNet validation images. For each triplet, we show the masked image (left), our MAE reconstructiony
(middle), and the ground-truth (right). The masking ratio is 80%, leaving only 39 out of 196 patches. More examples are in the appendix.
yAs no loss is computed on visible patches, the model output on visible patches is qualitatively worse. One can simply overlay the output with the visible
patches to improve visual quality. We intentionally opt not to do this, so we can more comprehensively demonstrate the method’s behavior.
Figure 3. Example results on COCO validation images, using an MAE trained on ImageNet (the same model weights as in Figure 2).
Observe the reconstructions on the two right-most examples, which, although different from the ground truth, are semantically plausible.
derstanding of parts, objects, and scenes. To overcome this
difference and encourage learning useful features, we show
that a simple strategy works well in computer vision: mask-
ing a very high portion of random patches. This strategy
largely reduces redundancy and creates a challenging self-
supervisory task that requires holistic understanding beyond
low-level image statistics. To get a qualitative sense of our
reconstruction task, see Figures 2 – 4.
(iii)The autoencoder’s decoder , which maps the latent
representation back to the input, plays a different role be-
tween reconstructing text and images. In vision, the decoder
reconstructs pixels , hence its output is of a lower semantic
level than common recognition tasks. This is in contrast
to language, where the decoder predicts missing words that
contain rich semantic information. While in BERT the de-
coder can be trivial (an MLP) [14], we found that for im-
ages, the decoder design plays a key role in determining the
semantic level of the learned latent representations.
Driven by this analysis, we present a simple, effective,
and scalable form of a masked autoencoder (MAE) for
visual representation learning. Our MAE masks random
patches from the input image and reconstructs the missing
patches in the pixel space. It has an asymmetric encoder-
decoder design. Our encoder operates only on the visible
subset of patches (without mask tokens), and our decoder islightweight and reconstructs the input from the latent rep-
resentation along with mask tokens (Figure 1). Shifting
the mask tokens to the small decoder in our asymmetric
encoder-decoder results in a large reduction in computation.
Under this design, a very high masking ratio ( e.g., 75%) can
achieve a win-win scenario: it optimizes accuracy while al-
lowing the encoder to process only a small portion ( e.g.,
25%) of patches. This can reduce overall pre-training time
by 3or more and likewise reduce memory consumption,
enabling us to easily scale our MAE to large models.
Our MAE learns very high-capacity models that gen-
eralize well. With MAE pre-training, we can train data-
hungry models like ViT-Large/-Huge [16] on ImageNet-1K
with improved generalization performance. With a vanilla
ViT-Huge model, we achieve 87.8% accuracy when ﬁne-
tuned on ImageNet-1K. This outperforms all previous re-
sults that use only ImageNet-1K data. We also evaluate
transfer learning on object detection, instance segmentation,
and semantic segmentation. In these tasks, our pre-training
achieves better results than its supervised pre-training coun-
terparts, and more importantly, we observe signiﬁcant gains
by scaling up models. These observations are aligned
with those witnessed in self-supervised pre-training in NLP
[14, 47, 48, 4] and we hope that they will enable our ﬁeld to
explore a similar trajectory.
2original mask 75% mask 85% mask 95%
Figure 4. Reconstructions of ImageNet validation images using
an MAE pre-trained with a masking ratio of 75% but applied on
inputs with higher masking ratios. The predictions differ plausibly
from the original images, showing that the method can generalize.
2. Related Work
Masked language modeling and its autoregressive coun-
terparts, e.g., BERT [14] and GPT [47, 48, 4], are highly
successful methods for pre-training in NLP. These methods
hold out a portion of the input sequence and train models
to predict the missing content. These methods have been
shown to scale excellently [4] and a large abundance of ev-
idence indicates that these pre-trained representations gen-
eralize well to various downstream tasks.
Autoencoding is a classical method for learning representa-
tions. It has an encoder that maps an input to a latent repre-
sentation and a decoder that reconstructs the input. For ex-
ample, PCA and k-means are autoencoders [29]. Denoising
autoencoders (DAE) [58] are a class of autoencoders that
corrupt an input signal and learn to reconstruct the origi-
nal, uncorrupted signal. A series of methods can be thought
of as a generalized DAE under different corruptions, e.g.,
masking pixels [59, 46, 6] or removing color channels [70].
Our MAE is a form of denoising autoencoding, but different
from the classical DAE in numerous ways.
Masked image encoding methods learn representations
from images corrupted by masking. The pioneering work
of [59] presents masking as a noise type in DAE. Context
Encoder [46] inpaints large missing regions using convolu-
tional networks. Motivated by the success in NLP, related
recent methods [6, 16, 2] are based on Transformers [57].
iGPT [6] operates on sequences of pixels and predicts un-
known pixels. The ViT paper [16] studies masked patch
prediction for self-supervised learning. Most recently, BEiT
[2] proposes to predict discrete tokens [44, 50].Self-supervised learning approaches have seen signiﬁcant
interest in computer vision, often focusing on different pre-
text tasks for pre-training [15, 61, 42, 70, 45, 17]. Re-
cently, contrastive learning [3, 22] has been popular, e.g.,
[62, 43, 23, 7], which models image similarity and dis-
similarity (or only similarity [21, 8]) between two or more
views. Contrastive and related methods strongly depend on
data augmentation [7, 21, 8]. Autoencoding pursues a con-
ceptually different direction, and it exhibits different behav-
iors as we will present.
3. Approach
Our masked autoencoder (MAE) is a simple autoencod-
ing approach that reconstructs the original signal given its
partial observation. Like all autoencoders, our approach
has an encoder that maps the observed signal to a latent
representation, and a decoder that reconstructs the origi-
nal signal from the latent representation. Unlike classical
autoencoders, we adopt an asymmetric design that allows
the encoder to operate only on the partial, observed signal
(without mask tokens) and a lightweight decoder that re-
constructs the full signal from the latent representation and
mask tokens. Figure 1 illustrates the idea, introduced next.
Masking. Following ViT [16], we divide an image into reg-
ular non-overlapping patches. Then we sample a subset of
patches and mask ( i.e., remove) the remaining ones. Our
sampling strategy is straightforward: we sample random
patches without replacement, following a uniform distribu-
tion. We simply refer to this as “random sampling”.
Random sampling with a high masking ratio ( i.e., the ra-
tio of removed patches) largely eliminates redundancy, thus
creating a task that cannot be easily solved by extrapolation
from visible neighboring patches (see Figures 2 – 4). The
uniform distribution prevents a potential center bias ( i.e.,
more masked patches near the image center). Finally, the
highly sparse input creates an opportunity for designing an
efﬁcient encoder, introduced next.
MAE encoder. Our encoder is a ViT [16] but applied only
onvisible, unmasked patches . Just as in a standard ViT, our
encoder embeds patches by a linear projection with added
positional embeddings, and then processes the resulting set
via a series of Transformer blocks. However, our encoder
only operates on a small subset ( e.g., 25%) of the full set.
Masked patches are removed; no mask tokens are used.
This allows us to train very large encoders with only a frac-
tion of compute and memory. The full set is handled by a
lightweight decoder, described next.
MAE decoder. The input to the MAE decoder is the full
set of tokens consisting of (i) encoded visible patches, and
(ii) mask tokens. See Figure 1. Each mask token [14] is a
shared, learned vector that indicates the presence of a miss-
3ing patch to be predicted. We add positional embeddings to
all tokens in this full set; without this, mask tokens would
have no information about their location in the image. The
decoder has another series of Transformer blocks.
The MAE decoder is only used during pre-training to
perform the image reconstruction task (only the encoder
is used to produce image representations for recognition).
Therefore, the decoder architecture can be ﬂexibly designed
in a manner that is independent of the encoder design. We
experiment with very small decoders, narrower and shal-
lower than the encoder. For example, our default decoder
has<10% computation per token vs. the encoder. With this
asymmetrical design, the full set of tokens are only pro-
cessed by the lightweight decoder, which signiﬁcantly re-
duces pre-training time.
Reconstruction target. Our MAE reconstructs the input
by predicting the pixel values for each masked patch. Each
element in the decoder’s output is a vector of pixel values
representing a patch. The last layer of the decoder is a lin-
ear projection whose number of output channels equals the
number of pixel values in a patch. The decoder’s output is
reshaped to form a reconstructed image. Our loss function
computes the mean squared error (MSE) between the recon-
structed and original images in the pixel space. We compute
the loss only on masked patches, similar to BERT [14].1
We also study a variant whose reconstruction target is
the normalized pixel values of each masked patch. Specif-
ically, we compute the mean and standard deviation of all
pixels in a patch and use them to normalize this patch. Us-
ing normalized pixels as the reconstruction target improves
representation quality in our experiments.
Simple implementation. Our MAE pre-training can be im-
plemented efﬁciently, and importantly, does not require any
specialized sparse operations. First we generate a token for
every input patch (by linear projection with an added po-
sitional embedding). Next we randomly shufﬂe the list of
tokens and remove the last portion of the list, based on the
masking ratio. This process produces a small subset of to-
kens for the encoder and is equivalent to sampling patches
without replacement. After encoding, we append a list of
mask tokens to the list of encoded patches, and unshufﬂe
this full list (inverting the random shufﬂe operation) to align
all tokens with their targets. The decoder is applied to this
full list (with positional embeddings added). As noted, no
sparse operations are needed. This simple implementation
introduces negligible overhead as the shufﬂing and unshuf-
ﬂing operations are fast.
1Computing the loss only on masked patches differs from traditional
denoising autoencoders [58] that compute the loss on all pixels. This
choice is purely result-driven: computing the loss on all pixels leads to
a slight decrease in accuracy ( e.g.,0.5%).
10 20 30 40 50 60 70 80 90838485
83.283.4 83.484.784.985.084.9 84.9
84.5
83.0fine-tuningmasking ratio (%)
10 20 30 40 50 60 70 80 90506070
54.658.961.767.069.971.873.2 73.571.8
66.1linear probing
masking ratio (%)
Figure 5. Masking ratio . A high masking ratio (75%) works well
for both ﬁne-tuning (top) and linear probing (bottom). The y-axes
are ImageNet-1K validation accuracy (%) in all plots in this paper.
4. ImageNet Experiments
We do self-supervised pre-training on the ImageNet-1K
(IN1K) [13] training set. Then we do supervised training to
evaluate the representations with (i) end-to-end ﬁne-tuning
or (ii) linear probing. We report top-1 validation accuracy
of a single 224224 crop. Details are in Appendix A.1.
Baseline: ViT-Large. We use ViT-Large (ViT-L/16) [16]
as the backbone in our ablation study. ViT-L is very big (an
order of magnitude bigger than ResNet-50 [25]) and tends
to overﬁt. The following is a comparison between ViT-L
trained from scratch vs. ﬁne-tuned from our baseline MAE:
scratch, original [16] scratch, our impl. baseline MAE
76.5 82.5 84.9
We note that it is nontrivial to train supervised ViT-L from
scratch and a good recipe with strong regularization is
needed (82.5%, see Appendix A.2). Even so, our MAE pre-
training contributes a big improvement. Here ﬁne-tuning is
only for 50 epochs ( vs. 200 from scratch), implying that the
ﬁne-tuning accuracy heavily depends on pre-training.
4.1. Main Properties
We ablate our MAE using the default settings in Table 1
(see caption). Several intriguing properties are observed.
Masking ratio. Figure 5 shows the inﬂuence of the mask-
ing ratio. The optimal ratios are surprisingly high. The ra-
tio of 75% is good for both linear probing and ﬁne-tuning.
This behavior is in contrast with BERT [14], whose typical
masking ratio is 15%. Our masking ratios are also much
higher than those in related works [6, 16, 2] in computer
vision (20% to 50%).
The model infers missing patches to produce different,
yet plausible, outputs (Figure 4). It makes sense of the
gestalt of objects and scenes, which cannot be simply com-
pleted by extending lines or textures. We hypothesize that
this reasoning-like behavior is linked to the learning of use-
ful representations.
Figure 5 also shows that linear probing and ﬁne-tuning
results follow different trends. For linear probing, the ac-
4blocks ft lin
1 84.8 65.5
2 84.9 70.0
4 84.9 71.9
8 84.9 73.5
12 84.4 73.3
(a)Decoder depth . A deep decoder can im-
prove linear probing accuracy.dim ft lin
128 84.9 69.1
256 84.8 71.3
512 84.9 73.5
768 84.4 73.1
1024 84.3 73.1
(b)Decoder width . The decoder can be nar-
rower than the encoder (1024-d).case ft lin FLOPs
encoder w/ [M] 84.2 59.6 3.3 
encoder w/o [M] 84.9 73.5 1
(c)Mask token . An encoder without mask to-
kens is more accurate and faster (Table 2).
case ft lin
pixel (w/o norm) 84.9 73.5
pixel (w/ norm) 85.4 73.9
PCA 84.6 72.3
dV AE token 85.3 71.6
(d)Reconstruction target . Pixels as recon-
struction targets are effective.case ft lin
none 84.0 65.7
crop, ﬁxed size 84.7 73.1
crop, rand size 84.9 73.5
crop + color jit 84.3 71.9
(e)Data augmentation . Our MAE works with
minimal or no augmentation.case ratio ft lin
random 75 84.9 73.5
block 50 83.9 72.3
block 75 82.8 63.9
grid 75 84.0 66.0
(f)Mask sampling . Random sampling works
the best. See Figure 6 for visualizations.
Table 1. MAE ablation experiments with ViT-L/16 on ImageNet-1K. We report ﬁne-tuning (ft) and linear probing (lin) accuracy (%). If
not speciﬁed, the default is: the decoder has depth 8 and width 512, the reconstruction target is unnormalized pixels, the data augmentation
is random resized cropping, the masking ratio is 75%, and the pre-training length is 800 epochs. Default settings are marked in gray .
curacy increases steadily with the masking ratio until the
sweet point: the accuracy gap is up to 20% (54.6% vs.
73.5%). For ﬁne-tuning, the results are less sensitive to the
ratios, and a wide range of masking ratios (40–80%) work
well. All ﬁne-tuning results in Figure 5 are better than train-
ing from scratch (82.5%).
Decoder design. Our MAE decoder can be ﬂexibly de-
signed, as studied in Table 1a and 1b.
Table 1a varies the decoder depth (number of Trans-
former blocks). A sufﬁciently deep decoder is important
for linear probing. This can be explained by the gap be-
tween a pixel reconstruction task and a recognition task: the
last several layers in an autoencoder are more specialized
for reconstruction, but are less relevant for recognition. A
reasonably deep decoder can account for the reconstruction
specialization, leaving the latent representations at a more
abstract level. This design can yield up to 8% improvement
in linear probing (Table 1a, ‘lin’). However, if ﬁne-tuning
is used, the last layers of the encoder can be tuned to adapt
to the recognition task. The decoder depth is less inﬂuential
for improving ﬁne-tuning (Table 1a, ‘ft’).
Interestingly, our MAE with a single -block decoder can
perform strongly with ﬁne-tuning (84.8%). Note that a sin-
gle Transformer block is the minimal requirement to propa-
gate information from visible tokens to mask tokens. Such
a small decoder can further speed up training.
In Table 1b we study the decoder width (number of chan-
nels). We use 512-d by default, which performs well un-
der ﬁne-tuning and linear probing. A narrower decoder also
works well with ﬁne-tuning.
Overall, our default MAE decoder is lightweight. It has
8 blocks and a width of 512-d ( gray in Table 1). It only
has 9% FLOPs per token vs. ViT-L (24 blocks, 1024-d).
As such, while the decoder processes all tokens, it is still a
small fraction of the overall compute.encoder dec. depth ft acc hours speedup
ViT-L, w/ [M] 8 84.2 42.4 -
ViT-L 8 84.9 15.4 2.8 
ViT-L 1 84.8 11.6 3.7
ViT-H, w/ [M] 8 - 119.6y-
ViT-H 8 85.8 34.5 3.5 
ViT-H 1 85.9 29.3 4.1
Table 2. Wall-clock time of our MAE training (800 epochs),
benchmarked in 128 TPU-v3 cores with TensorFlow. The speedup
is relative to the entry whose encoder has mask tokens (gray). The
decoder width is 512, and the mask ratio is 75%.y: This entry is
estimated by training ten epochs.
Mask token. An important design of our MAE is to skip
the mask token [M] in the encoder and apply it later in the
lightweight decoder. Table 1c studies this design.
If the encoder uses mask tokens, it performs worse : its
accuracy drops by 14% in linear probing. In this case,
there is a gap between pre-training and deploying: this en-
coder has a large portion of mask tokens in its input in pre-
training, which does not exist in uncorrupted images. This
gap may degrade accuracy in deployment. By removing the
mask token from the encoder, we constrain the encoder to
always see realpatches and thus improve accuracy.
Moreover, by skipping the mask token in the encoder,
we greatly reduce training computation. In Table 1c, we
reduce the overall training FLOPs by 3.3 . This leads to
a 2.8wall-clock speedup in our implementation (see Ta-
ble 2). The wall-clock speedup is even bigger (3.5–4.1 ),
for a smaller decoder (1-block), a larger encoder (ViT-H),
or both. Note that the speedup can be >4for a masking
ratio of 75%, partially because the self-attention complexity
is quadratic. In addition, memory is greatly reduced, which
can enable training even larger models or speeding up more
by large-batch training. The time and memory efﬁciency
makes our MAE favorable for training very large models.
5block 50% grid 75% random 75%
Figure 6. Mask sampling strategies determine the pretext task
difﬁculty, inﬂuencing reconstruction quality and representations
(Table 1f). Here each output is from an MAE trained with the spec-
iﬁed masking strategy. Left: random sampling (our default). Mid-
dle: block-wise sampling [2] that removes large random blocks.
Right: grid-wise sampling that keeps one of every four patches.
Images are from the validation set.
Reconstruction target. We compare different reconstruc-
tion targets in Table 1d. Our results thus far are based on
pixels without (per-patch) normalization. Using pixels with
normalization improves accuracy. This per-patch normal-
ization enhances the contrast locally. In another variant, we
perform PCA in the patch space and use the largest PCA
coefﬁcients (96 here) as the target. Doing so degrades ac-
curacy. Both experiments suggest that the high-frequency
components are useful in our method.
We also compare an MAE variant that predicts tokens ,
the target used in BEiT [2]. Speciﬁcally for this variant,
we use the DALLE pre-trained dV AE [50] as the tokenizer,
following [2]. Here the MAE decoder predicts the token in-
dices using cross-entropy loss. This tokenization improves
ﬁne-tuning accuracy by 0.4% vs. unnormalized pixels, but
has no advantage vs. normalized pixels. It also reduces lin-
ear probing accuracy. In §5 we further show that tokeniza-
tion is not necessary in transfer learning.
Our pixel -based MAE is much simpler than tokeniza-
tion. The dV AE tokenizer requires one more pre-training
stage, which may depend on extra data (250M images [50]).
The dV AE encoder is a large convolutional network (40%
FLOPs of ViT-L) and adds nontrivial overhead. Using pix-
els does not suffer from these problems.
Data augmentation. Table 1e studies the inﬂuence of data
augmentation on our MAE pre-training.
Our MAE works well using cropping-only augmenta-
tion, either ﬁxed-size or random-size (both having random
horizontal ﬂipping). Adding color jittering degrades the re-
sults and so we do not use it in other experiments.
Surprisingly, our MAE behaves decently even if using
no data augmentation (only center-crop, no ﬂipping). This
property is dramatically different from contrastive learning
and related methods [62, 23, 7, 21], which heavily rely
on data augmentation. It was observed [21] that using
cropping-only augmentation reduces the accuracy by 13%
100 200 400 800 160082838485
82.383.384.384.985.1
fine-tuningepochs (log-scale)
100 200 400 800 160060657075
57.364.469.773.575.1
linear probing
epochs (log-scale)
Figure 7. Training schedules . A longer training schedule gives a
noticeable improvement. Here each point is a full training sched-
ule. The model is ViT-L with the default setting in Table 1.
and 28% respectively for BYOL [21] and SimCLR [7]. In
addition, there is no evidence that contrastive learning can
work without augmentation: the two views of an image are
the same and can easily satisfy a trivial solution.
In MAE, the role of data augmentation is mainly per-
formed by random masking (ablated next). The masks are
different for each iteration and so they generate new training
samples regardless of data augmentation. The pretext task
is made difﬁcult by masking and requires less augmentation
to regularize training.
Mask sampling strategy. In Table 1f we compare different
mask sampling strategies, illustrated in Figure 6.
Theblock-wise masking strategy, proposed in [2], tends
to remove large blocks (Figure 6 middle). Our MAE with
block-wise masking works reasonably well at a ratio of
50%, but degrades at a ratio of 75%. This task is harder
than that of random sampling, as a higher training loss is
observed. The reconstruction is also blurrier.
We also study grid-wise sampling, which regularly keeps
one of every four patches (Figure 6 right). This is an eas-
ier task and has lower training loss. The reconstruction is
sharper. However, the representation quality is lower.
Simple random sampling works the best for our MAE. It
allows for a higher masking ratio, which provides a greater
speedup beneﬁt while also enjoying good accuracy.
Training schedule. Our ablations thus far are based on
800-epoch pre-training. Figure 7 shows the inﬂuence of the
training schedule length. The accuracy improves steadily
with longer training. Indeed, we have not observed sat-
uration of linear probing accuracy even at 1600 epochs.
This behavior is unlike contrastive learning methods, e.g.,
MoCo v3 [9] saturates at 300 epochs for ViT-L. Note that
the MAE encoder only sees 25% of patches per epoch,
while in contrastive learning the encoder sees 200% (two-
crop) or even more (multi-crop) patches per epoch.
6method pre-train data ViT-B ViT-L ViT-H ViT-H 448
scratch, our impl. - 82.3 82.6 83.1 -
DINO [5] IN1K 82.8 - - -
MoCo v3 [9] IN1K 83.2 84.1 - -
BEiT [2] IN1K+DALLE 83.2 85.2 - -
MAE IN1K 83.6 85.9 86.9 87.8
Table 3. Comparisons with previous results on ImageNet-
1K. The pre-training data is the ImageNet-1K training set (ex-
cept the tokenizer in BEiT was pre-trained on 250M DALLE data
[50]). All self-supervised methods are evaluated by end-to-end
ﬁne-tuning. The ViT models are B/16, L/16, H/14 [16]. The best
for each column is underlined. All results are on an image size of
224, except for ViT-H with an extra result on 448. Here our MAE
reconstructs normalized pixels and is pre-trained for 1600 epochs.
0 200 400 60076788082848688ViT-B/16ViT-L/16ViT-H/14
MAE, IN1K
supervised, IN1K, our impl.
supervised, IN1K
supervised, JFT300M
[16]
[16]
params (M)
Figure 8. MAE pre-training vs. supervised pre-training , evalu-
ated by ﬁne-tuning in ImageNet-1K (224 size). We compare with
the original ViT results [16] trained in IN1K or JFT300M.
4.2. Comparisons with Previous Results
Comparisons with self-supervised methods. In Table 3
we compare the ﬁne-tuning results of self-supervised ViT
models. For ViT-B, all methods perform closely. For ViT-L,
the gaps among methods are bigger, suggesting that a chal-
lenge for bigger models is to reduce overﬁtting.
Our MAE can scale up easily and has shown steady im-
provement from bigger models. We obtain 86.9% accuracy
using ViT-H (224 size). By ﬁne-tuning with a 448 size, we
achieve 87.8% accuracy, using only IN1K data . The pre-
vious best accuracy, among all methods using only IN1K
data, is 87.1% (512 size) [67], based on advanced networks.
We improve over the state-of-the-art by a nontrivial margin
in the highly competitive benchmark of IN1K (no external
data). Our result is based on vanilla ViT, and we expect
advanced networks will perform better.
Comparing with BEiT [2], our MAE is more accurate
while being simpler andfaster . Our method reconstructs
pixels, in contrast to BEiT that predicts tokens: BEiT re-
ported a 1.8% degradation [2] when reconstructing pixels
with ViT-B.2We do not need dV AE pre-training. More-
over, our MAE is considerably faster (3.5 per epoch) than
BEiT, for the reason as studied in Table 1c.
2We observed the degradation also in BEiT with ViT-L: it produces
85.2% (tokens) and 83.5% (pixels), reproduced from the ofﬁcial code.
0 1 2 4 6 12 18 2470758085
73.581.083.184.2 84.4 84.6 84.7 84.9
77.679.980.881.6 81.983.283.8 84.1
MAE baseline
MoCo v3# blocks ﬁne-tuned
Figure 9. Partial ﬁne-tuning results of ViT-L w.r.t. the number
of ﬁne-tuned Transformer blocks under the default settings from
Table 1. Tuning 0 blocks is linear probing; 24 is full ﬁne-tuning.
Our MAE representations are less linearly separable, but are con-
sistently better than MoCo v3 if one or more blocks are tuned.
The MAE models in Table 3 are pre-trained for 1600
epochs for better accuracy (Figure 7). Even so, our total
pre-training time is lessthan the other methods when trained
on the same hardware. For example, training ViT-L on 128
TPU-v3 cores, our MAE’s training time is 31 hours for 1600
epochs and MoCo v3’s is 36 hours for 300 epochs [9].
Comparisons with supervised pre-training. In the origi-
nal ViT paper [16], ViT-L degrades when trained in IN1K.
Our implementation of supervised training (see A.2) works
better, but accuracy saturates. See Figure 8.
Our MAE pre-training, using only IN1K, can general-
ize better: the gain over training from scratch is bigger for
higher-capacity models. It follows a trend similar to the
JFT-300M supervised pre-training in [16]. This compari-
son shows that our MAE can help scale up model sizes.
4.3. Partial Fine-tuning
Table 1 shows that linear probing and ﬁne-tuning results
are largely uncorrelated . Linear probing has been a popular
protocol in the past few years; however, it misses the oppor-
tunity of pursuing strong but non-linear features—which is
indeed a strength of deep learning. As a middle ground, we
study a partial ﬁne-tuning protocol: ﬁne-tune the last sev-
eral layers while freezing the others. This protocol was also
used in early works, e.g., [65, 70, 42].
Figure 9 shows the results. Notably, ﬁne-tuning only one
Transformer block boosts the accuracy signiﬁcantly from
73.5% to 81.0%. Moreover, if we ﬁne-tune only “half” of
the last block ( i.e., its MLP sub-block), we can get 79.1%,
much better than linear probing. This variant is essentially
ﬁne-tuning an MLP head. Fine-tuning a few blocks ( e.g., 4
or 6) can achieve accuracy close to full ﬁne-tuning.
In Figure 9 we also compare with MoCo v3 [9], a con-
trastive method with ViT-L results available. MoCo v3 has
higher linear probing accuracy; however, all of its partial
ﬁne-tuning results are worse than MAE. The gap is 2.6%
when tuning 4 blocks. While the MAE representations are
less linearly separable, they are stronger non-linear features
and perform well when a non-linear head is tuned.
7APboxAPmask
method pre-train data ViT-B ViT-L ViT-B ViT-L
supervised IN1K w/ labels 47.9 49.3 42.9 43.9
MoCo v3 IN1K 47.9 49.3 42.7 44.0
BEiT IN1K+DALLE 49.8 53.3 44.4 47.1
MAE IN1K 50.3 53.3 44.9 47.2
Table 4. COCO object detection and segmentation using a ViT
Mask R-CNN baseline. All entries are based on our implementa-
tion. Self-supervised entries use IN1K data without labels. Mask
AP follows a similar trend as box AP.
These observations suggest that linear separability is not
the sole metric for evaluating representation quality. It has
also been observed ( e.g., [8]) that linear probing is not well
correlated with transfer learning performance, e.g., for ob-
ject detection. To our knowledge, linear evaluation is not
often used in NLP for benchmarking pre-training.
5. Transfer Learning Experiments
We evaluate transfer learning in downstream tasks using
the pre-trained models in Table 3.
Object detection and segmentation. We ﬁne-tune Mask
R-CNN [24] end-to-end on COCO [37]. The ViT backbone
is adapted for use with FPN [36] (see A.3). We apply this
approach for all entries in Table 4. We report box AP for
object detection and mask AP for instance segmentation.
Compared to supervised pre-training, our MAE performs
better under all conﬁgurations (Table 4). With the smaller
ViT-B, our MAE is 2.4 points higher than supervised pre-
training (50.3 vs. 47.9, APbox). More signiﬁcantly, with the
larger ViT-L, our MAE pre-training outperforms supervised
pre-training by 4.0 points (53.3 vs. 49.3).
The pixel -based MAE is better than or on par with the
token -based BEiT, while MAE is much simpler and faster.
Both MAE and BEiT are better than MoCo v3 and MoCo
v3 is on par with supervised pre-training.
Semantic segmentation. We experiment on ADE20K [72]
using UperNet [63] (see A.4). Table 5 shows that our pre-
training signiﬁcantly improves results over supervised pre-
training, e.g., by 3.7 points for ViT-L. Our pixel-based MAE
also outperforms the token-based BEiT. These observations
are consistent with those in COCO.
Classiﬁcation tasks. Table 6 studies transfer learning on
the iNaturalists [56] and Places [71] tasks (see A.5). On
iNat, our method shows strong scaling behavior: accuracy
improves considerably with bigger models. Our results sur-
pass the previous best results by large margins . On Places,
our MAE outperforms the previous best results [19, 40],
which were obtained via pre-training on billions of images.
Pixels vs. tokens. Table 7 compares pixels vs. tokens as the
MAE reconstruction target. While using dV AE tokens is
better than using unnormalized pixels, it is statistically sim-
ilar to using normalized pixels across all cases we tested. It
again shows that tokenization is not necessary for our MAE.method pre-train data ViT-B ViT-L
supervised IN1K w/ labels 47.4 49.9
MoCo v3 IN1K 47.3 49.1
BEiT IN1K+DALLE 47.1 53.3
MAE IN1K 48.1 53.6
Table 5. ADE20K semantic segmentation (mIoU) using Uper-
Net. BEiT results are reproduced using the ofﬁcial code. Other
entries are based on our implementation. Self-supervised entries
use IN1K data without labels.
dataset ViT-B ViT-L ViT-H ViT-H 448 prev best
iNat 2017 70.5 75.7 79.3 83.4 75.4 [55]
iNat 2018 75.4 80.1 83.0 86.8 81.2 [54]
iNat 2019 80.5 83.4 85.7 88.3 84.1 [54]
Places205 63.9 65.8 65.9 66.8 66.0 [19]y
Places365 57.9 59.4 59.8 60.3 58.0 [40]z
Table 6. Transfer learning accuracy on classiﬁcation datasets ,
using MAE pre-trained on IN1K and then ﬁne-tuned. We provide
system-level comparisons with the previous best results.
y: pre-trained on 1 billion images.z: pre-trained on 3.5 billion images.
IN1K COCO ADE20K
ViT-B ViT-L ViT-H ViT-B ViT-L ViT-B ViT-L
pixel (w/o norm) 83.3 85.1 86.2 49.5 52.8 48.0 51.8
pixel (w/ norm) 83.6 85.9 86.9 50.3 53.3 48.1 53.6
dV AE token 83.6 85.7 86.9 50.3 53.2 48.1 53.4
4 0.0 -0.2 0.0 0.0 -0.1 0.0 -0.2
Table 7. Pixels vs. tokens as the MAE reconstruction target. 4is
the difference between using dV AE tokens and using normalized
pixels. The difference is statistically insigniﬁcant.
6. Discussion and Conclusion
Simple algorithms that scale well are the core of deep
learning. In NLP, simple self-supervised learning methods
(e.g., [47, 14, 48, 4]) enable beneﬁts from exponentially
scaling models. In computer vision, practical pre-training
paradigms are dominantly supervised ( e.g. [33, 51, 25, 16])
despite progress in self-supervised learning. In this study,
we observe on ImageNet and in transfer learning that
an autoencoder—a simple self-supervised method similar
to techniques in NLP—provides scalable beneﬁts. Self-
supervised learning in vision may now be embarking on a
similar trajectory as in NLP.
On the other hand, we note that images and languages
aresignals of a different nature and this difference must
be addressed carefully. Images are merely recorded light
without a semantic decomposition into the visual analogue
of words. Instead of attempting to remove objects, we re-
move random patches that most likely do notform a seman-
tic segment. Likewise, our MAE reconstructs pixels, which
arenotsemantic entities. Nevertheless, we observe ( e.g.,
Figure 4) that our MAE infers complex, holistic reconstruc-
tions, suggesting it has learned numerous visual concepts,
i.e., semantics. We hypothesize that this behavior occurs
by way of a rich hidden representation inside the MAE. We
hope this perspective will inspire future work.
8Broader impacts. The proposed method predicts content
based on learned statistics of the training dataset and as such
will reﬂect biases in those data, including ones with nega-
tive societal impacts. The model may generate inexistent
content. These issues warrant further research and consid-
eration when building upon this work to generate images.
References
[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer
normalization. arXiv:1607.06450 , 2016.
[2] Hangbo Bao, Li Dong, and Furu Wei. BEiT: BERT pre-training
of image transformers. arXiv:2106.08254 , 2021. Accessed in June
2021 .
[3] Suzanna Becker and Geoffrey E Hinton. Self-organizing neural
network that discovers surfaces in random-dot stereograms. Na-
ture, 1992.
[4] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah,
Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav
Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel
Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child,
Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris
Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Ben-
jamin Chess, Jack Clark, Christopher Berner, Sam McCandlish,
Alec Radford, Ilya Sutskever, and Dario Amodei. Language mod-
els are few-shot learners. In NeurIPS , 2020.
[5] Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien
Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties
in self-supervised vision transformers. In ICCV , 2021.
[6] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun,
David Luan, and Ilya Sutskever. Generative pretraining from pix-
els. In ICML , 2020.
[7] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey
Hinton. A simple framework for contrastive learning of visual rep-
resentations. In ICML , 2020.
[8] Xinlei Chen and Kaiming He. Exploring simple Siamese represen-
tation learning. In CVPR , 2021.
[9] Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of
training self-supervised Vision Transformers. In ICCV , 2021.
[10] Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D
Manning. ELECTRA: Pre-training text encoders as discriminators
rather than generators. In ICLR , 2020.
[11] Corinna Cortes and Vladimir Vapnik. Support-vector networks.
Machine learning , 1995.
[12] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Ran-
daugment: Practical automated data augmentation with a reduced
search space. In CVPR Workshops , 2020.
[13] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li
Fei-Fei. ImageNet: A large-scale hierarchical image database. In
CVPR , 2009.
[14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Toutanova. BERT: Pre-training of deep bidirectional transformers
for language understanding. In NAACL , 2019.
[15] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised
visual representation learning by context prediction. In ICCV ,
2015.
[16] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk
Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa De-
hghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, JakobUszkoreit, and Neil Houlsby. An image is worth 16x16 words:
Transformers for image recognition at scale. In ICLR , 2021.
[17] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsuper-
vised representation learning by predicting image rotations. In
ICLR , 2018.
[18] Xavier Glorot and Yoshua Bengio. Understanding the difﬁculty of
training deep feedforward neural networks. In AISTATS , 2010.
[19] Priya Goyal, Mathilde Caron, Benjamin Lefaudeux, Min Xu,
Pengchao Wang, Vivek Pai, Mannat Singh, Vitaliy Liptchinsky, Is-
han Misra, Armand Joulin, and Piotr Bojanowski. Self-supervised
pretraining of visual features in the wild. arXiv:2103.01988 , 2021.
[20] Priya Goyal, Piotr Doll ´ar, Ross Girshick, Pieter Noordhuis, Lukasz
Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and
Kaiming He. Accurate, large minibatch SGD: Training ImageNet
in 1 hour. arXiv:1706.02677 , 2017.
[21] Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec,
Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo
Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal
Piot, Koray Kavukcuoglu, Remi Munos, and Michal Valko. Boot-
strap your own latent - a new approach to self-supervised learning.
InNeurIPS , 2020.
[22] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality
reduction by learning an invariant mapping. In CVPR , 2006.
[23] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Gir-
shick. Momentum contrast for unsupervised visual representation
learning. In CVPR , 2020.
[24] Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Girshick.
Mask R-CNN. In ICCV , 2017.
[25] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep
residual learning for image recognition. In CVPR , 2016.
[26] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath,
Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak
Parajuli, Mike Guo, et al. The many faces of robustness: A critical
analysis of out-of-distribution generalization. In ICCV , 2021.
[27] Dan Hendrycks and Thomas Dietterich. Benchmarking neural net-
work robustness to common corruptions and perturbations. In
ICLR , 2019.
[28] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and
Dawn Song. Natural adversarial examples. In CVPR , 2021.
[29] Geoffrey E Hinton and Richard S Zemel. Autoencoders, minimum
description length, and helmholtz free energy. In NeurIPS , 1994.
[30] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Wein-
berger. Deep networks with stochastic depth. In ECCV , 2016.
[31] Sergey Ioffe and Christian Szegedy. Batch normalization: Accel-
erating deep network training by reducing internal covariate shift.
InICML , 2015.
[32] Insoo Kim, Seungju Han, Ji-won Baek, Seong-Jin Park, Jae-Joon
Han, and Jinwoo Shin. Quality-agnostic image recognition via in-
vertible decoder. In CVPR , 2021.
[33] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet clas-
siﬁcation with deep convolutional neural networks. In NeurIPS ,
2012.
[34] Yann LeCun, Bernhard Boser, John S Denker, Donnie Hender-
son, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel.
Backpropagation applied to handwritten zip code recognition. Neu-
ral computation , 1989.
[35] Yanghao Li, Saining Xie, Xinlei Chen, Piotr Doll ´ar, Kaiming He,
and Ross Girshick. Benchmarking detection transfer learning with
vision transformers. In preparation , 2021.
9[36] Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath
Hariharan, and Serge Belongie. Feature pyramid networks for ob-
ject detection. In CVPR , 2017.
[37] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro
Perona, Deva Ramanan, Piotr Doll ´ar, and C Lawrence Zitnick. Mi-
crosoft COCO: Common objects in context. In ECCV , 2014.
[38] Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient de-
scent with warm restarts. In ICLR , 2017.
[39] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regu-
larization. In ICLR , 2019.
[40] Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming
He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens
van der Maaten. Exploring the limits of weakly supervised pre-
training. In ECCV , 2018.
[41] Xiaofeng Mao, Gege Qi, Yuefeng Chen, Xiaodan Li, Ranjie Duan,
Shaokai Ye, Yuan He, and Hui Xue. Towards robust vision trans-
former. arXiv:2105.07926 , 2021.
[42] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual
representations by solving jigsaw puzzles. In ECCV , 2016.
[43] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representa-
tion learning with contrastive predictive coding. arXiv:1807.03748 ,
2018.
[44] Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neu-
ral discrete representation learning. In NeurIPS , 2017.
[45] Deepak Pathak, Ross Girshick, Piotr Doll ´ar, Trevor Darrell, and
Bharath Hariharan. Learning features by watching objects move.
InCVPR , 2017.
[46] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell,
and Alexei A Efros. Context encoders: Feature learning by inpaint-
ing. In CVPR , 2016.
[47] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya
Sutskever. Improving language understanding by generative pre-
training. 2018.
[48] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario
Amodei, and Ilya Sutskever. Language models are unsupervised
multitask learners. 2019.
[49] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan
Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu.
Exploring the limits of transfer learning with a uniﬁed text-to-text
transformer. JMLR , 2020.
[50] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea
V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot
text-to-image generation. In ICML , 2021.
[51] Karen Simonyan and Andrew Zisserman. Very deep convolutional
networks for large-scale image recognition. In ICLR , 2015.
[52] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon
Shlens, and Zbigniew Wojna. Rethinking the inception architec-
ture for computer vision. In CVPR , 2016.
[53] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa,
Alexandre Sablayrolles, and Herv ´e J´egou. Training data-efﬁcient
image transformers & distillation through attention. In ICML ,
2021.
[54] Hugo Touvron, Alexandre Sablayrolles, Matthijs Douze, Matthieu
Cord, and Herv ´e J´egou. Graﬁt: Learning ﬁne-grained image repre-
sentations with coarse labels. In ICCV , 2021.
[55] Hugo Touvron, Andrea Vedaldi, Matthijs Douze, and Herv ´e J´egou.
Fixing the train-test resolution discrepancy. arXiv:1906.06423 ,
2019.[56] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen
Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Be-
longie. The iNaturalist species classiﬁcation and detection dataset.
InCVPR , 2018.
[57] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin.
Attention is all you need. In NeurIPS , 2017.
[58] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-
Antoine Manzagol. Extracting and composing robust features with
denoising autoencoders. In ICML , 2008.
[59] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio,
Pierre-Antoine Manzagol, and L ´eon Bottou. Stacked denoising au-
toencoders: Learning useful representations in a deep network with
a local denoising criterion. JMLR , 2010.
[60] Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing.
Learning robust global representations by penalizing local predic-
tive power. In NeurIPS , 2019.
[61] Xiaolong Wang and Abhinav Gupta. Unsupervised learning of vi-
sual representations using videos. In ICCV , 2015.
[62] Zhirong Wu, Yuanjun Xiong, Stella Yu, and Dahua Lin. Unsuper-
vised feature learning via non-parametric instance discrimination.
InCVPR , 2018.
[63] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian
Sun. Uniﬁed perceptual parsing for scene understanding. In ECCV ,
2018.
[64] Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Doll ´ar,
and Ross Girshick. Early convolutions help transformers see better.
InNeurIPS , 2021.
[65] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How
transferable are features in deep neural networks? In NeurIPS ,
2014.
[66] Yang You, Igor Gitman, and Boris Ginsburg. Large batch training
of convolutional networks. arXiv:1708.03888 , 2017.
[67] Li Yuan, Qibin Hou, Zihang Jiang, Jiashi Feng, and Shuicheng Yan.
VOLO: Vision outlooker for visual recognition. arXiv:2106.13112 ,
2021.
[68] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun,
Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy
to train strong classiﬁers with localizable features. In ICCV , 2019.
[69] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David
Lopez-Paz. mixup: Beyond empirical risk minimization. In ICLR ,
2018.
[70] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image
colorization. In ECCV , 2016.
[71] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba,
and Aude Oliva. Learning deep features for scene recognition using
Places database. In NeurIPS , 2014.
[72] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler,
Adela Barriuso, and Antonio Torralba. Semantic understanding
of scenes through the ADE20K dataset. IJCV , 2019.
10A. Implementation Details
A.1. ImageNet Experiments
ViT architecture. We follow the standard ViT architecture
[16]. It has a stack of Transformer blocks [57], and each
block consists of a multi-head self-attention block and an
MLP block, both having LayerNorm (LN) [1]. The encoder
ends with LN. As the MAE encoder and decoder have dif-
ferent width, we adopt a linear projection layer after the
encoder to match it. Our MAE adds positional embeddings
[57] (the sine-cosine version) to both the encoder and de-
coder inputs. Our MAE does notuse relative position or
layer scaling (which are used in the code of [2]).
We extract features from the encoder output for ﬁne-
tuning and linear probing. As ViT has a class token [16],
to adapt to this design, in our MAE pre-training we append
an auxiliary dummy token to the encoder input. This token
will be treated as the class token for training the classiﬁer in
linear probing and ﬁne-tuning. Our MAE works similarly
well without this token (with average pooling).
Pre-training. The default setting is in Table 8. We do
notuse color jittering, drop path, or gradient clip. We use
xavier uniform [18] to initialize all Transformer blocks, fol-
lowing ViT’s ofﬁcial code [16]. We use the linear lrscaling
rule [20]: lr=base lrbatchsize / 256.
End-to-end ﬁne-tuning. Our ﬁne-tuning follows common
practice of supervised ViT training. The default setting is in
Table 9. We use layer-wise lrdecay [10] following [2].
Linear probing. Our linear classiﬁer training follows [9].
See Table 10. We observe that linear probing requires a very
different recipe than end-to-end ﬁne-tuning. In particular,
regularization is in general harmful for linear probing. Fol-
lowing [9], we disable many common regularization strate-
gies: we do notuse mixup [69], cutmix [68], drop path [30],
or color jittering, and we set weight decay as zero.
It is a common practice to normalize the classiﬁer input
when training a classical linear classiﬁer ( e.g., SVM [11]).
Similarly, it is beneﬁcial to normalize the pre-trained fea-
tures when training the linear probing classiﬁer. Follow-
ing [15], we adopt an extra BatchNorm layer [31] without
afﬁne transformation ( affine=False ). This layer is ap-
plied on the pre-trained features produced by the encoder,
and is before the linear classiﬁer. We note that the layer
does notbreak the linear property, and it can be absorbed
into the linear classiﬁer after training: it is essentially a re-
parameterized linear classiﬁer.3Introducing this layer helps
calibrate the feature magnitudes across different variants in
our ablations, so that they can use the same setting without
further lrsearch.
3Alternatively, we can pre-compute the mean and std of the features
and use the normalized features to train linear classiﬁers.conﬁg value
optimizer AdamW [39]
base learning rate 1.5e-4
weight decay 0.05
optimizer momentum 1;2=0:9;0:95[6]
batch size 4096
learning rate schedule cosine decay [38]
warmup epochs [20] 40
augmentation RandomResizedCrop
Table 8. Pre-training setting.
conﬁg value
optimizer AdamW
base learning rate 1e-3
weight decay 0.05
optimizer momentum 1;2=0:9;0:999
layer-wise lr decay [10, 2] 0.75
batch size 1024
learning rate schedule cosine decay
warmup epochs 5
training epochs 100 (B), 50 (L/H)
augmentation RandAug (9, 0.5) [12]
label smoothing [52] 0.1
mixup [69] 0.8
cutmix [68] 1.0
drop path [30] 0.1 (B/L) 0.2 (H)
Table 9. End-to-end ﬁne-tuning setting.
conﬁg value
optimizer LARS [66]
base learning rate 0.1
weight decay 0
optimizer momentum 0.9
batch size 16384
learning rate schedule cosine decay
warmup epochs 10
training epochs 90
augmentation RandomResizedCrop
Table 10. Linear probing setting. We use LARS with a large
batch for faster training; SGD works similarly with a 4096 batch.
Partial ﬁne-tuning. Our MAE partial ﬁne-tuning (§4.3)
follows the setting in Table 9, except that we adjust the num-
ber of ﬁne-tuning epochs. We observe that tuning fewer
blocks requires a longer schedule. We set the numbers of
ﬁne-tuning epochs as f50, 100, 200gand use the optimal
one for each number of blocks tuned.
A.2. Supervised Training ViT-L/H from Scratch
We ﬁnd that it is nontrivial to train supervised ViT-L/H
from scratch on ImageNet-1K. The training is unstable.
While there have been strong baselines with publicly avail-
able implementations [53] for smaller models, the recipes
for the larger ViT-L/H are unexplored. Directly applying
the previous recipes to these larger models does not work.
A NaN loss is frequently observed during training.
We provide our recipe in Table 11. We use a wdof 0.3,
a large batch size of 4096, and a long warmup, following
the original ViT [16]. We use 2=0:95following [6]. We
use the regularizations listed in Table 11 and disable others,
following [64]. All these choices are for improving training
stability. Our recipe can ﬁnish training with no NaN loss.
11conﬁg value
optimizer AdamW
base learning rate 1e-4
weight decay 0.3
optimizer momentum 1;2=0:9;0:95
batch size 4096
learning rate schedule cosine decay
warmup epochs 20
training epochs 300 (B), 200 (L/H)
augmentation RandAug (9, 0.5) [12]
label smoothing [52] 0.1
mixup [69] 0.8
cutmix [68] 1.0
drop path [30] 0.1 (B), 0.2 (L/H)
exp. moving average (EMA) 0.9999
Table 11. Supervised training ViT from scratch.
The accuracy is 82.6% for ViT-L (81.5% w/o EMA), and
83.1% for ViT-H (80.9% w/o EMA). Both ViT-L and ViT-H
show an overﬁtting trend if not using EMA.
As a by-product, our recipe for ViT-B has 82.3% accu-
racy (82.1% w/o EMA), vs. 81.8% in [53].
A.3. Object Detection and Segmentation in COCO
We adapt the vanilla ViT for the use of an FPN backbone
[36] in Mask R-CNN [24]. ViT has a stack of Transformer
blocks that all produce feature maps at a single scale ( e.g.,
stride 16). We equally divide this stack into 4 subsets and
apply convolutions to upsample or downsample the inter-
mediate feature maps for producing different scales (stride
4, 8, 16, or 32, the same as a standard ResNet [25]). FPN is
built on these multi-scale maps.
For fair comparisons among different methods, we
search for hyper-parameters for each entry in Table 4 (in-
cluding all competitors). The hyper-parameters we search
for are the learning rate, weight decay, drop path rate, and
ﬁne-tuning epochs. We will release code along with the
speciﬁc conﬁgurations. For full model and training details,
plus additional experiments, see [35].
A.4. Semantic Segmentation in ADE20K
We use UperNet [63] following the semantic segmenta-
tion code of [2]. We ﬁne-tune end-to-end for 100 epochs
with a batch size of 16. We search for the optimal lrfor
each entry in Table 5 (including all competitors).
The semantic segmentation code of [2] uses relative po-
sition bias [49]. Our MAE pre-training does notuse it. For
fair comparison, we turn on relative position bias only dur-
ing transfer learning, initialized as zero. We note that our
BEiT reproduction uses relative position bias in both pre-
training and ﬁne-tuning, following their code.
A.5. Additional Classiﬁcation Tasks
We follow the setting in Table 9 for iNaturalist and
Places ﬁne-tuning (Table 6). We adjust the lrand ﬁne-
tuning epochs for each individual dataset.method model params acc
iGPT [6] iGPT-L 1362 M 69.0
iGPT [6] iGPT-XL 6801 M 72.0
BEiT [2] ViT-L 304 M 52.1y
MAE ViT-B 86 M 68.0
MAE ViT-L 304 M 75.8
MAE ViT-H 632 M 76.6
Table 12. Linear probing results of masked encoding methods .
Our ﬁne-tuning results are in Table 3.y: our implementation.
dataset ViT-B ViT-L ViT-H ViT-H 448prev best
IN-Corruption#[27] 51.7 41.8 33.8 36.8 42.5 [32]
IN-Adversarial [28] 35.9 57.1 68.2 76.7 35.8 [41]
IN-Rendition [26] 48.3 59.9 64.4 66.5 48.7 [41]
IN-Sketch [60] 34.5 45.3 49.6 50.9 36.0 [41]
our supervised training baselines:
IN-Corruption# 45.8 42.3 41.3
IN-Adversarial 27.2 29.6 33.1
IN-Rendition 49.4 50.9 50.3
IN-Sketch 35.6 37.5 38.0
Table 13. Robustness evaluation on ImageNet variants (top-1
accuracy, except for IN-C [27] which evaluates mean corruption
error). We test the same MAE models (Table 3) on different Im-
ageNet validation sets, without any specialized ﬁne-tuning. We
provide system-level comparisons with the previous best results.
B. Comparison on Linear Probing Results
In §4.3 we have shown that linear probing accuracy and
ﬁne-tuning accuracy are largely uncorrelated and they have
different focuses about linear separability. We notice that
existing masked image encoding methods are generally less
competitive in linear probing ( e.g., than contrastive learn-
ing). For completeness, in Table 12 we compare on linear
probing accuracy with masking-based methods.
Our MAE with ViT-L has 75.8% linear probing accu-
racy. This is substantially better than previous masking-
based methods. On the other hand, it still lags behind con-
trastive methods under this protocol: e.g., MoCo v3 [9] has
77.6% linear probing accuracy for the ViT-L (Figure 9).
C. Robustness Evaluation on ImageNet
In Table 13 we evaluate the robustness of our models on
different variants of ImageNet validation sets. We use the
same models ﬁne-tuned on original ImageNet (Table 3) and
only run inference on the different validation sets, without
any specialized ﬁne-tuning. Table 13 shows that our method
has strong scaling behavior: increasing the model sizes has
signiﬁcant gains. Increasing the image size helps in all sets
but IN-C. Our results outperform the previous best results
(of specialized systems) by large margins.
In contrast, supervised training performs much worse
(Table 13 bottom; models described in A.2). For example,
with ViT-H, our MAE pre-training is 35% better on IN-A
(68.2% vs 33.1%) than the supervised counterpart.
12Figure 10. Uncurated random samples on ImageNet validation images. For each triplet, we show the masked image (left), our MAE
reconstruction (middle), and the ground-truth (right). The masking ratio is 75%.
13Figure 11. Uncurated random samples on COCO validation images, using an MAE trained on ImageNet. For each triplet, we show the
masked image (left), our MAE reconstruction (middle), and the ground-truth (right). The masking ratio is 75%.
14