Abstract: Vision-language foundation models pretrained on large-scale data influence many visual understanding tasks. Notably, many vision-language models build two encoders (visual and textual) that can map two modalities into the same embedding space. As a result, the learned representations achieve good zero-shot performance on tasks like image classification. However, when there are only a few examples per category, the potential of large vision-language models is not fully realized, mainly due to the disparity between the vast number of parameters and the relatively limited amount of training data. This paper shows that we can significantly improve the performance of few-shot classification by using the category names to initialize the classification head. More interestingly, we can borrow the non-perfect category names, or even names from a foreign language, to improve the few-shot classification performance compared with random initialization. With the proposed category name initialization method, our model obtains state-of-the-art performance on several few-shot image classification benchmarks (e.g., 87.37% on ImageNet and 96.08% on Stanford Cars, both using five-shot learning). Additionally, we conduct an in-depth analysis of category name initialization, explore the point at which the benefits of category names decrease, examine how distillation techniques can enhance the performance of smaller models, and investigate other pivotal factors and intriguing phenomena in the realm of few-shot learning. Our findings offer valuable insights and guidance for future research endeavors.
Submission Length: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=6KKp5esk5G
Changes Since Last Submission: We have made several updates based on the reviewers' comments.
- We include the discussion of some references mentioned by reviewer U7zf in the last paragraph of Section 2.
- We include the discussion of de-duplication procedures in Section 4.1.
- We update our comparison in Table 2 by adding the few-shot performance of CoCa-2b + random initialization and add discussion in the paragraph on Category name initialization vs. random initialization in Section 4.3.
- We include more examples in Figure 3.
- We add the clarification of normalization in the second paragraph of Section 3.2.
- We have fixed the typos mentioned by review 6zmr in the revised manuscript.
- In Section 4.5, we include the update that the method will not be applicable when category names are unavailable.
- We add the discussion of hyperparameters and computational cost and merge it with the Optimization paragraph in Section 3.2 and the original Section 4.1 (Data) into Section 4.1, renamed Experimental Setup.
Assigned Action Editor: ~Brian_Kulis1
Submission Number: 1079
Loading