Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models

Published: 02 May 2024, Last Modified: 25 Jun 2024ICML 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: It has recently been discovered that using a pre-trained *vision-language model* (VLM), e.g., CLIP, to align a whole query image with several finer text descriptions generated by a large language model can significantly enhance zero-shot performance. However, in this paper, we empirically find that the finer descriptions tend to align more effectively with *local areas of the query image* rather than the whole image, and then we theoretically validate this finding. Thus, we present a method called *weighted visual-text cross alignment* (WCA). This method begins with a *localized visual prompting* technique, designed to identify local visual areas within the query image. The local visual areas are then *cross-aligned* with the finer descriptions by creating a similarity matrix using the pre-trained VLM. To determine how well a query image aligns with each category, we develop a score function based on the weighted similarities in this matrix. Extensive experiments demonstrate that our method significantly improves zero-shot performance across various datasets, achieving results that are even comparable to few-shot learning methods.
Submission Number: 3077
Loading