everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
Recent advancements in vision-language models (VLMs), designed for simultaneous comprehension of vision and language, have demonstrated significant success in achieving zero-shot classification capabilities. However, despite their impressive performance, it is widely acknowledged that fine-tuning is essential to adapt these models to new target tasks. This adaptation process requires the collection of target datasets, which may introduce incorrect labels and greatly compromise the model performance after fine-tuning. In this paper, our objective is to enhance classification fine-tuning performance by leveraging the zero-shot classification capability under a noisy labeled training dataset. We first conduct a detailed exploration of the behavior of the pre-trained VLMs under various classification text prompts, including human-crafted and LLM-crafted visual characteristics. This investigation reveals that VLMs have tilted knowledge towards some classes, and each prompt exhibits varying expertise for each class. Based on these observations, we introduce a robust training method called PoND, which employs a complementary approach across different types of prompts, leveraging the expertise of each class. We systematically compare the efficacy of the proposed algorithm with existing denoising techniques designed for VLMs and substantiate that our proposed algorithm outperforms prior approaches across 11 real-world datasets.