Semantic Anchor Transport: Robust Test-Time Adaptation for Vision-Language Models
Abstract: Large pre-trained vision-language models (VLMs) like CLIP exhibit strong zero-shot performance but struggle under distributional shifts. We propose Semantic Anchor Transport (SAT), a method that generates pseudo-labels for test samples by aligning visual embeddings with reliable text-based semantic anchors using Optimal Transport for batch-wise label assignment. These pseudo-labels enable efficient test-time adaptation through principled cross-modal alignment. We further incorporate multi-template distillation to leverage diverse textual clues, replicating multi-view contrastive learning without added computational cost. Extensive experiments demonstrate consistent performance gains over state-of-the-art methods across multiple benchmarks while maintaining computational efficiency.
Submission Number: 20
Loading