Regression-based Test-Time Adaptation of Vision-Language Models

ICLR 2026 Conference Submission12036 Authors

18 Sept 2025 (modified: 24 Nov 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-Language Models;Test-Time Adaptation;Regression Learning;
Abstract: Mainstream Test-Time Adaptation (TTA) techniques aims to select confident views with lower entropy from a set of augmented views to process instance-level adaptation for vision-language models, e,g, CLIP. However, this entropy-based strategy, relying only on current instance probability distribution, struggles to estimate reliable entropy for outliers. Surprisingly, we observe that using ground-truth cross-entropy loss on labeled data to select confident views can achieve overwhelming performance, which motivates us to directly establish a regression mapping between augmented views and their corresponding cross-entropy loss. This paper proposes a Regression-based Test-time Adaptation (RTA) that exploits such view-loss relationships as a `free lunch' for CLIP-based image classification. By training a regression model on diversely distributed data independent of downstream data, we can predict the cross-entropy loss for each augmented view during actual TTA, thereby achieving more accurate view selection without access to true labels. The significant advantage of RTA is that the view-loss mapping relationship can be estimated in advance on diverse data, avoiding the current methods that rely solely on the probability distribution of a single test instance. Extensive experiments on multiple single-label and multi-label datasets show that RTA significantly outperforms existing entropy-based TTA methods in CLIP image multi-classification and multi-label classification at negligible computational cost. Our code is available at https://anonymous.4open.science/r/RTA-2ADD
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 12036
Loading