Abstract: Recent advances in vision-language foundational models, such as CLIP, have demonstrated significant strides in zero-shot classification. However, the extensive parameterization of models like CLIP necessitates a resource-intensive fine-tuning process. In response, TIP-Adapter and SuS-X have introduced training-free methods aimed at bolstering the efficacy of downstream tasks. While these approaches incorporate support sets to maintain data distribution consistency between knowledge cache and test sets, they often fall short in terms of generalization on the test set, particularly when faced with test data exhibiting substantial distributional variations. In this work, we present CapS-Adapter, an innovative method that employs a caption-based support set, effectively harnessing both image and caption features to exceed existing state-of-the-art techniques in training-free scenarios. CapS-Adapter adeptly constructs support sets that closely mirror target distributions, utilizing instance-level distribution features extracted from multimodal large models. By leveraging CLIP's single and cross-modal strengths, CapS-Adapter enhances predictive accuracy through the use of multimodal support sets. Our method achieves outstanding zero-shot classification results across 19 benchmark datasets, improving accuracy by 2.19\% over the previous leading method. Our contributions are substantiated through extensive validation on multiple benchmark datasets, demonstrating superior performance and robust generalization capabilities.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: We introduce CapS-Adapter, a framework designed to adapt visual-language models for downstream tasks by leveraging a multimodal support set constructed from image captions.
We propose an efficient method (CapS) for constructing multimodal support sets that are closely aligned with the target distribution. CapS solves the problem of the support set deviating from the target distribution and content repetition, and innovatively incorporates both text features and image features into the support set simultaneously.
We present m-Adapter an inference approach that adapts vision-language models for downstream tasks using multimodal support sets. This approach takes advantage of the text and image features stored in the multimodal support sets, leveraging the single-modal and cross-modal capabilities of vision-language models to provide robust predictive power.
Supplementary Material: zip
Submission Number: 4737
Loading