Keywords: Vision-Language Models, Few-Shot Learning, Domain Generalization, Transfer Learning
TL;DR: We introduce DualAdapter, a novel framework that incorporates positive and negative adapters across both vision and language modalities, ensuring efficient and effective adaptation.
Abstract: Leveraging vast datasets on the Internet, large-scale Vision-Language Models (VLMs) demonstrates great potential in learning open-world visual concepts, and exhibit remarkable performance across a wide range of downstream tasks through efficient fine-tuning. In this work, we propose a simple yet effective fine-tuning approach called DualAdapter, which for the first time investigates the inference capabilities of VLMs along both positive and negative directions. Unlike conventional approaches that solely rely on positive adapter-style fine-tuning, DualAdapter uniquely incorporate negative text descriptions and image samples, enabling fine-tuning from a dual perspective.
During the few-shot adaptation process, our DualAdapter explicitly enhances correct alignments while simultaneously minimizing incorrect associations. Our rigorous evaluation across 15 datasets reveals that DualAdapter significantly surpasses existing state-of-the-art methods in terms of both adaptation efficiency and robustness to distribution shifts.
Submission Number: 71
Loading