GS-Bias: Global-Spatial Bias Learner for Single-Image Test-Time Adaptation of Vision-Language Models

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent advances in test-time adaptation (TTA) for Vision-Language Models (VLMs) have garnered increasing attention, particularly through the use of multiple augmented views of a single image to boost zero-shot generalization. Unfortunately, existing methods fail to strike a satisfactory balance between performance and efficiency, either due to excessive overhead of tuning text prompts or unstable benefits from handcrafted, training-free visual feature enhancement. In this paper, we present Global-Spatial Bias Learner (GS-Bias), an efficient and effective TTA paradigm that incorporates two learnable biases during TTA, unfolded as the global bias and spatial bias. Particularly, the global bias captures the global semantic features of a test image by learning consistency across augmented views, while spatial bias learns the semantic coherence between regions in the image’s spatial visual representation. It is worth highlighting that these two sets of biases are directly added to the logits outputed by the pretrained VLMs, which circumvent the full backpropagation through VLM that hinders the efficiency of existing TTA methods. This endows GS-Bias with extremely high efficiency while achieving state-of-the-art performance on 15 benchmark datasets. For example, it achieves a 2.23% improvement over TPT in cross-dataset generalization and a 2.72% improvement in domain generalization, while requiring only 6.5% of TPT's memory usage on ImageNet.
Lay Summary: Test-time adaptation (TTA) allows vision-language models to handle new, unseen images without any additional training. It typically works by generating different versions of the same image and letting the model learn from them. Existing methods either learn custom text prompts for each image or aggregate various visual features to adapt to new inputs. However, these approaches can be slow, memory-intensive, or produce unstable results. To address this, our paper proposes GS-Bias, a simple yet reliable method that subtly adjusts the model’s predictions without altering its original reasoning process. It introduces two lightweight “hints”: one helps the model form a global understanding of the image, while the other guides it to focus more precisely on local regions. This small modification leads to significant improvements, making the model faster, more efficient, and more accurate.
Primary Area: Deep Learning
Keywords: Test Time Adaptation, Vision-Language Models, Deep Learning
Submission Number: 5703
Loading