Abstract: Like masked language modeling (MLM) in NLP, masked image modeling (MIM) extracts insights from image patches to enhance feature extraction in deep neural networks (DNNs). Unlike supervised learning, MIM pretraining requires substantial computational resources to handle large batch sizes (e.g., 4096), limiting its scalability. To address this, we propose Block-Wise Masked Image Modeling (BIM), which decomposes MIM tasks into sub-tasks with independent computations, enabling block-wise backpropagation instead of the traditional end-to-end approach. BIM achieves comparable performance to MIM while significantly reducing peak memory usage. For evaluation, we provide an anonymized GitHub repository ~\href{https://anonymous.4open.science/r/BIM_ICML2025/}{here}. Additionally, BIM facilitates concurrent training of multiple DNN backbones with varying depths, optimizing them for different hardware platforms while reducing computational costs compared to training each backbone separately.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: N/A
Assigned Action Editor: ~Dit-Yan_Yeung2
Submission Number: 5231
Loading