Keywords: Test-time adaptation; Vision language model; classification; deep learning; calibration
Abstract: Despite the remarkable zero-shot performance of vision-language models, such as Contrastive Language-Image Pretraining (CLIP), on many downstream tasks, their potential may be degraded under distributional shifts. Test-time adaptation (TTA) offers a solution by adapting the model to these shifts during inference, without requiring labeled data. Prior methods like CLIP-OT leverage optimal transport for pseudo-labeling. However, the quality of these labels can be unreliable, leading to suboptimal adaptation and error accumulation. To address this, we propose CLIP-DR, which introduces two extra key components: (1) a cosine similarity loss to align image features with textual prototypes, stabilizing the adaptation direction; and (2) an information maximization regularizer to promote confident and diverse predictions, preventing model collapse. Extensive evaluation on seven benchmarks (covering 15 corruption types and domain shifts, totaling $\sim$6000 trials) demonstrates that CLIP-DR consistently outperforms state-of-the-art methods while adding $\sim$0.01 seconds of computing time per batch (e.g., 4\% and 12\% higher than CLIP-OT and WATT-S on the TinyImageNet-C dataset with 1.98 second per batch).
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 2333
Loading