CrossMAE: Decoupled Masked Image Modeling with Cross Prediction

TMLR Paper437 Authors

15 Sept 2022 (modified: 28 Feb 2023)Rejected by TMLREveryoneRevisionsBibTeX
Abstract: We present CrossMAE, a novel and flexible masked image modeling~(MIM) approach, which decouples the existing mask-then-predict objective in MIM progressively into mask-then-draw and draw-then-predict. During the mask-then-draw phase, an auxiliary drawing head models the uncertainty and produces coarse and diverse outputs. Subsequently in draw-then-predict, the backbone receives the completions and strives to predict versatile signals from them. These two decoupled objectives are end-to-end trainable and involved in a single pass, splitting low-level generation and high-level understanding. Through extensive experiments and compelling results on a variety of tasks, we demonstrate that the proposed pre-training scheme learns generalizable features effectively, including image classification, semantic segmentation, object detection, instance segmentation, and even facial landmark detection. Beyond surpassing existing MIM counterparts, CrossMAE exhibits better data efficiency, in both pre-training and fine-tuning.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Hanwang_Zhang3
Submission Number: 437
Loading