CAE v2: Context Autoencoder with CLIP Latent Alignment

Xinyu Zhang; Jiahui Chen; Junkun Yuan; Qiang Chen; Jian Wang; Xiaodi Wang; Shumin Han; Xiaokang Chen; Jimin Pi; Kun Yao; Junyu Han; Errui Ding; Jingdong Wang

CAE v2: Context Autoencoder with CLIP Latent Alignment

Xinyu Zhang, Jiahui Chen, Junkun Yuan, Qiang Chen, Jian Wang, Xiaodi Wang, Shumin Han, Xiaokang Chen, Jimin Pi, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang

Published: 05 Oct 2023, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Masked image modeling (MIM) learns visual representations by predicting the masked patches on a pre-defined target. Inspired by MVP(Wei et al., 2022b) that displays impressive gains with CLIP, in this work, we also employ the semantically rich CLIP latent as target and further tap its potential by introducing a new MIM pipeline, CAE v2, to learn a high-quality encoder and facilitate model convergence on the pre-training task. CAE v2 is an improved variant of CAE (Chen et al., 2023), applying the CLIP latent on two pretraining tasks, i.e., visible latent alignment and masked latent alignment. Visible latent alignment directly mimics the visible latent representations from the encoder to the corresponding CLIP latent, which is beneficial for facilitating model convergence and improving the representative ability of the encoder. Masked latent alignment predicts the representations of masked patches within the feature space of CLIP latent as standard MIM task does, effectively aligning the representations computed from the encoder and the regressor into the same domain. We pretrain CAE v2 on ImageNet-1K images and evaluate on various downstream vision tasks, including image classification, semantic segmentation, object detection and instance segmentation. Experiments show that our CAE v2 achieves competitive performance and even outperforms the CLIP vision encoder, demonstrating the effectiveness of our method. Code is available at https://github.com/Atten4Vis/CAE.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: We sincerely appreciate the editor and all reviewers' time and efforts in reviewing our paper. We deeply thank your valuable and constructive suggestions for improving our paper. Following the editor's suggestion, we have revised our manuscript by further incorporating the reviewer's concerns during the discussion after the rebuttal into camera-ready version, including the clarifications about the positioning of the paper with respect to CLIP [4Cp8]. We thank the editor and all the reviewers again!

Code: https://github.com/Atten4Vis/CAE

Assigned Action Editor: ~Fuxin_Li1

License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Submission Number: 1130

Loading