Enhancing Vision-Language Model with Unmasked Token Alignment

Published: 26 May 2024, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Contrastive pre-training on image-text pairs, exemplified by CLIP, becomes a standard technique for learning multi-modal visual-language representations. Although CLIP has demonstrated remarkable performance, training it from scratch on noisy web-scale datasets is computationally demanding. On the other hand, mask-then-predict pre-training approaches, like Masked Image Modeling (MIM), offer efficient self-supervised learning for single-modal representations. This paper introduces $\textbf{U}$nmasked $\textbf{T}$oken $\textbf{A}$lignment ($\textbf{UTA}$), a method that leverages existing CLIP models to further enhance its vision-language representations. UTA trains a Vision Transformer (ViT) by aligning unmasked visual tokens to the corresponding image tokens from a frozen CLIP vision encoder, which automatically aligns the ViT model with the CLIP text encoder. The pre-trained ViT can be directly applied for zero-shot evaluation even without training on image-text pairs. Compared to MIM approaches, UTA does not suffer from training-finetuning inconsistency and is much more training-efficient by avoiding using the extra $\mathrm{[MASK]}$ tokens. Extensive experimental results demonstrate that UTA can enhance CLIP models and outperform existing MIM methods on various uni- and multi-modal benchmarks.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Add additional experiment results to further demonstrate the effectiveness of the proposed method. Add additional explanations of the proposed method. Fix some minor issues, including typos, references, and table placement.
Assigned Action Editor: ~Kui_Jia1
Submission Number: 1874
Loading