AlignCLIP: Enhancing Stable Representations in Vision-Language Pretraining Models through Attention and Prediction Alignment

19 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: transfer learning, meta learning, and lifelong learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Language Vision Pretraining Models, Foundation Models, Domain Adaptation, Out-of-Distribution Generalization
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: This paper introduces AlignCLIP, a novel method designed to enhance the stability of representations in Vision-Language Pretraining models like CLIP by addressing attention and predictive category misalignments.
Abstract: Stable representations are pivotal in Vision-Language Pretraining (VLP) models serving as the foundation for managing domain shifts and recognizing unseen classes in open-world environments. In this paper, we identify and delve into two primary misalignment problems in VLP models like contrastive language-image pre-training (CLIP): attention misalignment, where the model disproportionately allocates attention to background visual tokens, and predictive category misalignment, indicative of the model's struggle to discern class similarities accurately. Addressing these misalignments is paramount, as they undermine the stability of representations and, consequently, the adaptability and trustworthiness of the model in open-world environments. To counteract these misalignments, we introduce AlignCLIP, a new parameter fine-tuning method. AlignCLIP introduces a novel training objective, the attention alignment loss, to harmonize attention distributions of multi-head attention layers and the correlation between visual tokens and class prompts. Further, AlignCLIP presents semantic label smoothing, aiming to preserve prediction hierarchy based on class similarity derived from textual information. Our empirical studies across varied datasets and out-of-distribution contexts demonstrate AlignCLIP's superior performance in enhancing stable representations and excelling in generalization methodologies, proving its adaptability and stability in scenarios involving domain shifts and unseen classes.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1850
Loading