CAE v2: Context Autoencoder with CLIP Latent Alignment
Abstract: Masked image modeling (MIM) learns visual representations by predicting the masked patches on a pre-defined target. Inspired by MVP(Wei et al., 2022b) that displays impressive gains with CLIP, in this work, we also employ the semantically rich CLIP latent as target and further tap its potential by introducing a new MIM pipeline, CAE v2, to learn a high-quality encoder and facilitate model convergence on the pre-training task. CAE v2 is an improved variant of CAE (Chen et al., 2023), applying the CLIP latent on two pretraining tasks, i.e., visible latent alignment and masked latent alignment. Visible latent alignment directly mimics the visible latent representations from the encoder to the corresponding CLIP latent, which is beneficial for facilitating model convergence and improving the representative ability of the encoder. Masked latent alignment predicts the representations of masked patches within the feature space of CLIP latent as standard MIM task does, effectively aligning the representations computed from the encoder and the regressor into the same domain. We pretrain CAE v2 on ImageNet-1K images and evaluate on various downstream vision tasks, including image classification, semantic segmentation, object detection and instance segmentation. Experiments show that our CAE v2 achieves competitive performance and even outperforms the CLIP vision encoder, demonstrating the effectiveness of our method. Code is available at https://github.com/Atten4Vis/CAE.
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We sincerely appreciate the editor and all reviewers' time and efforts in reviewing our paper. We deeply thank your valuable and constructive suggestions for improving our paper. Following the editor's suggestion, we have revised our manuscript by further incorporating the reviewer's concerns during the discussion after the rebuttal into camera-ready version, including the clarifications about the positioning of the paper with respect to CLIP [4Cp8]. We thank the editor and all the reviewers again!
Assigned Action Editor: ~Fuxin_Li1
Submission Number: 1130