In-context Prompt Learning for Test-time Vision Recognition with Frozen Vision-Language Model

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: prompt learning, in-context examples, test-time adaptation, vision and language model, vision recognition
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Existing pre-trained vision-language models, e.g., CLIP, have displayed impressive zero-shot generalization capabilities in various downstream tasks. When facing test inputs with different distributions, however, the performance of these models will degrade significantly. To this end, we explore the concept of test-time prompt tuning (TTPT), which enables the adaptation of the CLIP model to novel downstream tasks through only one step of optimization on an unsupervised objective that involves the test sample. One noteworthy aspect is that TTPT operates without specific task supervision, such as fine-tuning. This makes it a particularly intriguing approach, as it allows us to investigate how a pre-trained visual-language model like CLIP can adapt to downstream tasks without requiring task-specific supervision or model modifications. Drawing inspiration from recent advancements in in-context learning within the field of natural language processing (NLP), we introduce the concept of \textit{visual in-context prompting}. This involves associating a new test sample with very few or even just one labeled example as its in-context prompt. As a result, we can reliably estimate a label for the test sample, thereby facilitating the adaptation process. Our approach employs a token net to represent language descriptions as visual prompts that the vision encoder of a CLIP model can comprehend. Paired with in-context examples, we further propose a semi-supervised loss to optimize test sample-aware visual prompts. This optimization allows a pre-trained, frozen CLIP model to be adapted to a test sample from any task using its learned adaptive prompt. To further enhance the integration of visual and text prompts, we design a cyclic learning strategy. Our method has demonstrated superior performance and achieved state-of-the-art results across various downstream datasets.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: pdf
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5321
Loading