Abstract: Finetuning a pretrained vision model (PVM) is a common technique for learning downstream vision tasks.
However, the conventional finetuning process with randomly sampled data points results in diminished training efficiency.
To address this drawback, we propose a novel approach, Vision-language Collaborative Active Finetuning (VeCAF).
With the emerging availability of labels and natural language annotations of images through web-scale crawling or controlled generation, VeCAF makes use of these information to perform parametric data selection for PVM finetuning. VeCAF incorporates the finetuning objective to select significant data points that effectively guide the PVM towards faster convergence to meet the performance goal. This process is assisted by the inherent semantic richness of the text embedding space which we use to augment image features.
Furthermore, the flexibility of text-domain augmentation allows VeCAF to handle out-of-distribution scenarios without external data.
Extensive experiments show the leading performance and high computational efficiency of VeCAF that is superior to baselines in both in-distribution and out-of-distribution image classification tasks.
On ImageNet, VeCAF uses up to 3.3$\times$ less training batches to reach the target performance compared to full finetuning, and achieves an accuracy improvement of 2.8\% over the state-of-the-art active finetuning method with the same number of batches.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: This work contributes to the downstream finetuning of pretrained vision models for multimedia/multimodal processing. The proposed method selects the most informative data points from a large, annotated dataset by utilizing the joint information of image features and text annotations. This approach improves the convergence speed in early finetuning epochs which enables higher performance with less training time and cost. This benefit is crucial for multimedia/multimodal tasks where limited time and resources are available for the downstream finetuning.
Supplementary Material: zip
Submission Number: 293
Loading