Downstream Task Guided Masking Learning in Masked Autoencoders Using Multi-Level Optimization

Han Guo; Ramtin Hosseini; Ruiyi Zhang; Sai Ashish Somayajula; Ranak Roy Chowdhury; Rajesh K. Gupta; Pengtao Xie

Downstream Task Guided Masking Learning in Masked Autoencoders Using Multi-Level Optimization

Han Guo, Ramtin Hosseini, Ruiyi Zhang, Sai Ashish Somayajula, Ranak Roy Chowdhury, Rajesh K. Gupta, Pengtao Xie

Published: 11 Apr 2025, Last Modified: 11 Apr 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Masked Autoencoder (MAE) is a notable method for self-supervised pretraining in visual representation learning. It operates by randomly masking image patches and reconstructing these masked patches using the unmasked ones. A key limitation of MAE lies in its disregard for the varying informativeness of different patches, as it uniformly selects patches to mask. To overcome this, some approaches propose masking based on patch informativeness. However, these methods often do not consider the specific requirements of downstream tasks, potentially leading to suboptimal representations for these tasks. In response, we introduce the Multi-level Optimized Mask Autoencoder (MLO-MAE), a novel framework that leverages end-to-end feedback from downstream tasks to learn an optimal masking strategy during pretraining. Our experimental findings highlight MLO-MAE's significant advancements in visual representation learning. Compared to existing methods, it demonstrates remarkable improvements across diverse datasets and tasks, showcasing its adaptability and efficiency. Our code is available at https://github.com/Alexiland/MLO-MAE.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: 1. We have reorganized and separated the method and experiment section (Section 3 and Section 4 in the revised manuscript). Additionally, we have revised Section 3.1 with a more intuitive explanation of the overall MLO-MAE framework to enhance the clarity of our method. 2. Additional experiments on MLO-MAE training solely on semantic segmentation feedback and continued pretraining without ImageNet pretrained initialization have been added to Table 6 and Table 9, corresponding descriptions have also been updated in Section 4.4 and Section 4.5. 3. We have added Section 7 to directly discuss the limitations and Appendix A.5 to discuss the convergence property of our MLO-MAE framework. 4. We have added Appendix E to further discuss the utilization of unlabeled data and fairness in comparing baseline methods under different experimental settings.

Code: https://github.com/Alexiland/MLO-MAE

Supplementary Material: zip

Assigned Action Editor: ~Di_He1

Submission Number: 3774

Loading