Vision ELECTRA: Adversarial Masked Image Modeling with Hierarchical Discriminator

18 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Masked Image Modeling, Vision ELECTRA, Adversarial Pre-training
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: As a practical pre-training strategy for natural language processing (NLP), ELECTRA first masks parts of input texts and trains a generator and discriminator to reconstruct the texts and identify which parts are original or replaced. In this work, we propose \underline{V}ision \underline{E}LECTRA, namely $\mathcal{VE}$, which migrates ELECTRA to the vision domain with a non-trivial extension. Like ELECTRA, $\mathcal{VE}$ first leverages MAE or SimMIM to reconstruct images from masked image patches by generation. Particularly, random Gaussian noise is induced into the latent space of the generator to enhance the diversity of generated patches, in an adversarial autoencoding manner. Later, given original images and the reconstructed ones, $\mathcal{VE}$ trains an image encoder (usually ViT or Swin) via a hierarchical discrimination loss, where the discriminator is expected to (1) differentiate between original images and the reconstructed ones and (2) differentiate between original patches and generated ones. It gives $\mathcal{VE}$ a unique advantage that learns contextual representations characterizing images in both macro- and micro-levels (i.e., the entire image and individual patches). Extensive experiments have been carried out to evaluate $\mathcal{VE}$ with baselines under fair comparisons. The findings demonstrate that $\mathcal{VE}$ based on the ViT-B attains a top-1 acc of 83.43\% on the ImageNet-1K image classification task with a 1.17\% improvement over baselines under continual pre-training. When transferring $\mathcal{VE}$ pre-trained models to other CV tasks, including segmentation and detection, our method surpasses other methods, demonstrating its applicability on various tasks.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1187
Loading