Region Attention Fine-tuning with CLIP for Few-shot Classification

Guangxing Wu, Junxi Chen, Qiu Li, Wentao Zhang, Wei-Shi Zheng, Ruixuan Wang

Published: 01 Jan 2024, Last Modified: 27 Jul 2025ICME 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: With the advancements in visual language models such as CLIP and their strong performance in zero-shot recognition, numerous CLIP-based methods have emerged in the field of few-shot classification. However, many of them do not fully leverage the abundant feature information within the CLIP visual encoder and overlook the issue of varying region-specific importance for image classification across different datasets. To address these limitations, we present an attention pooling-based framework for few-shot fine-tuning. Our framework enables the model to learn task-specific attention weights for image regions, while also incorporating background features and a consistency constraint to enhance training. As a result, our approach outperforms the state-of-the-art approaches on 11 benchmarks, demonstrating its effectiveness.