Keywords: Vision Language models, Efficient Adaptation, Prompt Tuning, Few-shot learning
TL;DR: A prompt tuning approach which incorporates patch level information
Abstract: Prompt tuning is an efficient way to adapt large foundation models, such as CLIP, by introducing learnable prompts with the input data tokens, offering a practical alternative to full model finetuning. However, when prompts are trained on base/target tasks, they often overfit, leading to reduced performance on novel, unseen tasks. To address this limitation, various techniques leverage global image semantics to improve accuracy on unseen tasks while maintaining performance on base tasks. However, they often overlook the rich fine-grained local information that could be crucial for capturing finer semantics and improving generalisation. In this work, we propose a modular approach to prompt tuning that leverages local semantics by incorporating patch-level information, representing the first integration of such semantics in this context. Specifically, we integrate patch-level information across vision, text, and predictions through three consistency mechanisms: 1) Patch-based consistency loss that aligns patches from the prompted input image with those from the same image processed by a frozen model, while also enforcing inter-view consistency by applying the loss across different views, capturing fine-grained regional dependencies and improving vision representation quality, 2) Text prompt consistency loss, where view-specific text prompts are tailored and regularised to maintain coherence across views, and 3) Vision features for each view, enriched with patch-level information, are used to generate predictions based on view-tailored text features. These predictions are then regularised across views, complementing the earlier consistency mechanisms and contributing to a cohesive overall framework. Our approach outperforms existing methods across multiple benchmarks, including base-to-novel generalisation, domain generalisation, and cross-dataset evaluation. These results underscore the potential of integrating fine-grained details for more robust and adaptable prompts, marking a step forward in foundation model tuning.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3668
Loading