Local Masked Reconstruction for Efficient Self-Supervised Learning on High-resolution Images

TMLR Paper1745 Authors

25 Oct 2023 (modified: 28 Mar 2024)Rejected by TMLREveryoneRevisionsBibTeX
Abstract: Self-supervised learning for computer vision has achieved tremendous progress and improved many downstream vision tasks, such as image classification, semantic segmentation, and object detection. Among these, generative self-supervised vision learning approaches such as MAE and BEiT show promising performance. However, their global reconstruction mechanism is computationally demanding, especially for high-resolution images. The computational cost will extensively increase when it is scaled to a large-scale dataset. To address this issue, we propose local masked reconstruction (LoMaR), a simple yet effective approach that reconstructs image patches from small neighboring regions. The strategy can be easily integrated into any generative self-supervised learning techniques and improves the trade-off between efficiency and accuracy compared to reconstruction over the entire image. LoMaR is 2.5$\times$ faster than MAE and 5.0$\times$ faster than BEiT on 384$\times$384 ImageNet pretraining, and surpasses them by 0.2\% and 0.8\% in accuracy, respectively. It is 2.1$\times$ faster than MAE on iNaturalist pretraining and gains 0.2\% in accuracy. On MS COCO, LoMaR outperforms MAE by 0.5 $\text{AP}^\text{box}$ on object detection and 0.5 $\text{AP}^\text{mask}$ on instance segmentation. It also outperforms MAE by 0.2\% on semantic segmentation. Our code will be made publicly available.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Mathieu_Salzmann1
Submission Number: 1745
Loading