Keywords: tes time, visual concepts
Abstract: Large vision-language models such as CLIP are widely deployed under conditions that differ from pre-training, causing visual patch tokens to drift from the semantic regions expected by the text-aligned head.
We propose *test-time concept anchoring* (TTCA), a training-free module that treats the visual tokens of a test image as a source measure and a task-conditioned bank of text concepts as a target measure, then solves an entropic optimal transport problem to softly project selected tokens toward semantic anchors before the downstream head consumes them.
TTCA operates per sample, requires no backpropagation, and admits an unbalanced variant with a reject sink for open-set noise.
On CLIP ViT-B/16, TTCA improves zero-shot accuracy on CIFAR-100 by $1.0$\%, improves mean accuracy across 9 corruption types (89\% of individual conditions improved), reduces distractor-induced accuracy degradation by 41\%, and improves CIFAR-100 calibration all at roughly 4 ms per image with **no** model parameter changes.
Submission Number: 40
Loading