Abstract: Since the release of the CLIP model by OpenAI, it has received widespread attention. However, categories in the real world often exhibit a long-tail distribution, and existing CLIP models struggle to effectively recognize rare, tail-end classes, such as an endangered African bird.
An intuitive idea is to generate visual descriptions for these tail-end classes and use descriptions to create category prototypes for classification.
However, experiments reveal that visual descriptions, image captions, and test prompt templates belong to three distinct domains, leading to distribution shifts.
In this paper, we propose the use of caption object parsing to identify the objects set contained within captions.
During training, the object sets is used to generate visual descriptions and test prompts, aligning these three domains and enabling the text encoder to generate category prototypes based on visual descriptions.
Thanks to the acquired object sets, our approach can construct many-to-many relationships at a lower cost and derive soft labels, addressing the noise issues associated with traditional one-to-one matching. Extensive experimental results demonstrate that our method significantly surpasses the CLIP baseline and exceeds existing methods, achieving a new state-of-the-art (SOTA).
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Vision and Language, [Generation] Multimedia Foundation Models, [Experience] Multimedia Applications
Relevance To Conference: This article sets out to solve some problems in the CLIP training process to achieve better alignment of language and vision, which belongs to the classic field of multi-modality.
Supplementary Material: zip
Submission Number: 5203
Loading