Rethinking Patch Dependence for Masked Autoencoders

Published: 13 Oct 2024, Last Modified: 02 Dec 2024NeurIPS 2024 Workshop SSLEveryoneRevisionsBibTeXCC BY 4.0
Keywords: self-supervised learning, visual pre-training, representation learning
TL;DR: Learning visual representation doesn't require the model to generate self-consistent images.
Abstract: In this work, we present cross-attention masked autoencoders (CrossMAE). This framework employs only cross-attention in the decoder to independently read out reconstructions for a small subset of masked patches from encoder outputs, yet it achieves comparable or superior performance to traditional MAE across models ranging from ViT-S to ViT-H. CrossMAE challenges the necessity of interaction between mask tokens for effective masked pretraining. Code is available [here](https://anonymous.4open.science/r/mae-cross-anon-11EB/README.md).
Submission Number: 83
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview