Training-Free Feature Reconstruction with Sparse Optimization for Vision-Language Models

Yi Zhang, Ke Yu, Angelica I. Aviles-Rivero, Jiyuan Jia, Yushun Tang, Zhihai He

Published: 28 Oct 2024, Last Modified: 06 Nov 2025CrossrefEveryoneRevisionsCC BY-SA 4.0

Abstract: In this paper, we address the challenge of adapting vision-language models (VLMs) to few-shot image recognition in a training-free manner. We observe that existing methods are not able to effectively characterize the semantic relationship between support and query samples in a training-free setting. We recognize that, in the semantic feature space, the feature of the query image is a linear and sparse combination of support image features since support-query pairs are from the class and share the same small set of distinctive visual attributes. Motivated by this interesting observation, we propose a novel method called Training-free Feature ReConstruction with Sparse optimization (TaCo), which formulates the few-shot image recognition task as a feature reconstruction and sparse optimization problem. Specifically, we exploit the VLM to encode the query and support images into features. We utilize sparse optimization to reconstruct the query feature from the corresponding support features. The feature reconstruction error is then used to define the reconstruction similarity. Coupled with the text-image similarity provided by the VLM, our reconstruction similarity analysis accurately characterizes the relationship between support and query images. This results in significantly improved performance in few-shot image recognition. Our extensive experimental results on few-shot recognition demonstrate that our method outperforms existing state-of-the-art approaches by substantial margins.

External IDs:doi:10.1145/3664647.3680710