Keywords: Active Test-Time Adaptation, Vision-Language Model, Retrieval-Augmented
Abstract: Pre-trained vision-language models (VLMs) have demonstrated remarkable performance across various real-world benchmarks. In particular, CLIP, one of the famous VLMs, has achieved satisfactory performance on vision-language tasks without fine-tuning (\ie zero-shot setting). Nevertheless, it is well-known that effectively leveraging a pre-trained model requires adaptation to the test distribution. Since the test distribution is typically unknown, test-time adaptation (TTA) has emerged as one of the solutions. However, existing TTA algorithms rely not on expert-provided ground-truth knowledge but on pseudo-labels derived from the knowledge of the pre-trained model itself. This undesirable reliance can lead to a cascade of incorrect knowledge propagation. To address this issue, we propose a novel framework, active test-time adaptation, which selectively queries human experts for ground-truth labels of uncertain samples and incorporates them for answering future queries. Then, we develop a novel algorithm, **RE**trieval-augmented **ACT**ive TTA (**REACT**), which is designed to be plug-and-play with any TTA algorithms. Through extensive experiments on ten real-world benchmarks commonly used in CLIP evaluation as well as a domain transfer benchmark based on ImageNet, the proposed algorithm is shown to effectively identify and query informative samples, leveraging them to enhance test-time inference capabilities.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 5479
Loading