Abstract: In this paper, we address the challenge of adapting vision-language models (VLMs) to few-shot image recognition in a training-free
manner. We observe that existing methods are not able to effectively characterize the semantic relationship between support and query samples in a training-free setting. We recognize that, in the semantic feature space, the feature of the query image is a linear and sparse combination of support image features since support-query pairs are from the class and share the same small set of distinctive visual attributes. Motivated by this interesting observation, we propose a novel method called Training-free Feature ReConstruction with Sparse optimization (TaCo), which formulates the few-shot image recognition task as a feature reconstruction and sparse optimization problem. Specifically, we exploit the VLM to encode the query and support images into features. We utilize sparse optimization to reconstruct the query feature from the corresponding support features. The feature reconstruction error is then used to define the reconstruction similarity. Coupled with the text-image similarity provided by the VLM, our reconstruction similarity analysis accurately characterizes the relationship between support and query images. This results in significantly improved performance in few-shot image recognition. Our extensive experimental results on few-shot recognition demonstrate that the proposed method outperforms existing state-of-the-art approaches by substantial margins.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: Our work revolutionizes multimedia processing by introducing a novel, efficient approach to cross-modal few-shot image recognition without the need for extensive training. By recognizing that the features of query images are linear and sparse combinations of support image features within the same class, this research leverages sparse optimization for feature reconstruction, utilizing vision-language models (VLMs) for semantic encoding. This method allows for a more precise characterization of the semantic relationships between images, significantly enhancing few-shot image recognition capabilities. The introduction of TaCo marks a significant leap forward in multimedia/multimodal processing, offering a scalable, adaptable solution that mitigates the limitations posed by the need for large datasets and intensive computation. Through comprehensive experiments, TaCo has demonstrated superior performance over existing state-of-the-art methods, demonstrating its potential to transform how multimedia content is processed and understood, especially in applications requiring quick adaptation to new data with minimal prior training. This breakthrough opens new avenues for advanced, efficient multimodal processing across various domains.
Supplementary Material: zip
Submission Number: 1611
Loading