Keywords: prompt learning; domain specific data; few-shot learning; vision-language foundation models; CLIP
Abstract: Large vision-language foundation models such as
CLIP have demonstrated great potential in zero-shot transferabil-
ity to downstream tasks. However, manual prompt engineering is
the major challenge for deploying such models in practice since
it requires domain expertise and extreme time. To avoid non-
trivial prompt engineering, recent work Context Optimization
(CoOp) introduced the concept of prompt learning to the vision
domain using learnable textual tokens. While CoOp can achieve
substantial improvements over manual prompts, its learned
context is worse generalizable to wider unseen classes within
the same dataset. In this work, we present Prompt Learning
with Reparameterization Encoder (PRE) - a simple and efficient
method that enhances the generalization ability of the learnable
prompt to unseen classes in practical domains. Instead of directly
optimizing the prompts, PRE employs a prompt encoder to
reparameterize the input prompt embeddings, enhancing the
exploration of domain-specific knowledge from few-shot data.
Experiments and extensive ablation studies on 8 benchmarks
demonstrate that our approach is an efficient method for prompt
learning in vision-language foundation models. Specifically, PRE
achieves a notable enhancement of 5.60% in average accuracy
on New classes and 3% in Harmonic mean compared to CoOp
in the 16-shot setting
Primary Subject Area: Role of data in foundation models: pre-training, prompting, fine-tuning
Paper Type: Research paper: up to 8 pages
DMLR For Good Track: Participate in DMLR for Good Track
Participation Mode: In-person
Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.
Submission Number: 102
Loading