Rethinking Patch Dependence for Masked Autoencoders

Letian Fu; Long Lian; Renhao Wang; Baifeng Shi; XuDong Wang; Adam Yala; Trevor Darrell; Alexei A Efros; Ken Goldberg

Rethinking Patch Dependence for Masked Autoencoders

Letian Fu, Long Lian, Renhao Wang, Baifeng Shi, XuDong Wang, Adam Yala, Trevor Darrell, Alexei A Efros, Ken Goldberg

Published: 09 Apr 2025, Last Modified: 09 Apr 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: In this work, we examine the impact of inter-patch dependencies in the decoder of masked autoencoders (MAE) on representation learning. We decompose the decoding mechanism for masked reconstruction into self-attention between mask tokens and cross-attention between masked and visible tokens. Our findings reveal that MAE reconstructs coherent images from visible patches not through interactions between patches in the decoder but by learning a global representation within the encoder. This discovery leads us to propose a simple visual pretraining framework: cross-attention masked autoencoders (CrossMAE). This framework employs only cross-attention in the decoder to independently read out reconstructions for a small subset of masked patches from encoder outputs. This approach achieves comparable or superior performance to traditional MAE across models ranging from ViT-S to ViT-H and significantly reduces computational requirements. By its design, CrossMAE challenges the necessity of interaction between mask tokens for effective masked pretraining. Code and models are publicly available: https://crossmae.github.io/

Submission Length: Regular submission (no more than 12 pages of main content)

Code: https://crossmae.github.io/

Assigned Action Editor: ~Hongsheng_Li3

Submission Number: 3517

Loading