Prototype-Based Test-Time Adaptation of Vision-Language Models

Zhaohong Huang; Yuxin Zhang; Wenjing Liu; Fei Chao; Rongrong Ji

Prototype-Based Test-Time Adaptation of Vision-Language Models

Zhaohong Huang, Yuxin Zhang, Wenjing Liu, Fei Chao, Rongrong Ji

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Test-time adaptation (TTA) has emerged as a promising paradigm for vision–language models (VLMs) to bridge the distribution gap between pre-training and test data. Recent works have focused on backpropagation-free TTA methods that rely on cache-based designs, but these introduce two key limitations. First, inference latency increases as the cache grows with the number of classes, leading to inefficiencies in large-scale settings. Second, suboptimal performance occurs when the cache contains insufficient or incorrect samples. In this paper, we present Prototype-Based Test-Time Adaptation (PTA), an efficient and effective TTA paradigm that uses a set of class-specific knowledge prototypes to accumulate knowledge from test samples. Particularly, knowledge prototypes are adaptively weighted based on the zero-shot class confidence of each test sample, incorporating the sample's visual features into the corresponding class-specific prototype. It is worth highlighting that the knowledge from past test samples is integrated and utilized solely in the prototypes, eliminating the overhead of cache population and retrieval that hinders the efficiency of existing TTA methods. This endows PTA with extremely high efficiency while achieving state-of-the-art performance on 15 image recognition benchmarks and 4 robust point cloud analysis benchmarks. For example, PTA improves CLIP’s accuracy from 65.64\% to 69.38\% on 10 cross-domain benchmarks, while retaining 92\% of CLIP’s inference speed on large-scale ImageNet-1K. In contrast, the cache-based TDA achieves a lower accuracy of 67.97\% and operates at only 50\% of CLIP’s inference speed.

Lay Summary: Test-time adaptation helps vision-language models such as CLIP become more reliable when they encounter images from unfamiliar domains, without requiring labels or retraining. Existing methods often work like “taking notes.” They store past test samples in a cache and use them to help classify new images. However, as the notes grow, searching becomes slower, and if some notes are wrong or noisy, future predictions can also be misled. To address this, we propose Prototype-Based Test-Time Adaptation (PTA). Instead of storing many scattered test samples, PTA distills them into a compact “class representative” for each category. Each new image acts as a piece of evidence. Confident samples strongly shape the representative, while uncertain samples are treated more cautiously. In this way, PTA builds stable class-level memory from the test stream. In this way, PTA compresses test-time experience into stable class-level memory. It avoids expensive cache search, reduces the impact of noisy samples, and improves recognition accuracy while keeping inference speed close to the original CLIP model.

Primary Area: Deep Learning

Keywords: Test-Time Adaptation, Vision-Language Models

Originally Submitted PDF: pdf

Submission Number: 5857

Loading