Keywords: self-supervised learning, visual pre-training, representation learning
TL;DR: Learning visual representation doesn't require the model to generate self-consistent images.
Abstract: In this work, we present cross-attention masked autoencoders (CrossMAE). This framework employs only cross-attention in the decoder to independently read out reconstructions for a small subset of masked patches from encoder outputs, yet it achieves comparable or superior performance to traditional MAE across models ranging from ViT-S to ViT-H. CrossMAE challenges the necessity of interaction between mask tokens for effective masked pretraining. Code is available [here](https://anonymous.4open.science/r/mae-cross-anon-11EB/README.md).
Submission Number: 83
Loading