Abstract: With the advancements in visual language models such as CLIP and their strong performance in zero-shot recognition, numerous CLIP-based methods have emerged in the field of few-shot classification. However, many of them do not fully leverage the abundant feature information within the CLIP visual encoder and overlook the issue of varying region-specific importance for image classification across different datasets. To address these limitations, we present an attention pooling-based framework for few-shot fine-tuning. Our framework enables the model to learn task-specific attention weights for image regions, while also incorporating background features and a consistency constraint to enhance training. As a result, our approach outperforms the state-of-the-art approaches on 11 benchmarks, demonstrating its effectiveness.
Loading