Attn-Adapter: Attention is all you need for Online Few-shot Learner of Vision-Language Model

Phuoc-Nguyen Bui; Khanh-Binh Nguyen; Hyunseung Choo

Attn-Adapter: Attention is all you need for Online Few-shot Learner of Vision-Language Model

Phuoc-Nguyen Bui, Khanh-Binh Nguyen, Hyunseung Choo

Published: 27 Aug 2025, Last Modified: 01 Oct 2025LIMIT 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision-language models, Few-shot learning, Attention mechanism, Generalization

TL;DR: Attn-Adapter enhances CLIP with dual attention for fast few-shot learning without retraining. It outperforms top methods in cross-category/dataset tasks, scales efficiently across CLIP backbones, and maintains light inference.

Abstract: Contrastive vision-language models excel in zero-shot image recognition but face challenges in few-shot scenarios due to computationally intensive offline fine-tuning via prompt learning, which risks overfitting. To overcome these limitations, we propose Attn-Adapter, a novel online few-shot learning framework that enhances CLIP’s adaptability via a dual attention mechanism. Our design incorporates dataset-specific information through two components: the Memory Attn-Adapter, which refines category embeddings using support examples, and the Local-Global Attn-Adapter, which enriches image embeddings by integrating local and global features. This architecture enables dynamic adaptation from a few labeled samples without retraining the base model. Attn-Adapter outperforms state-of-the-art methods in cross-category and cross-dataset generalization, maintaining efficient inference and scaling across CLIP backbones.

Submission Number: 21

Loading