Keywords: Model Calibration, Promt Tuning, Train- Time prompt tuning, VLM, Long-Tail
TL;DR: Calibrating train time vision language models
Abstract: Prompt tuning of large-scale vision-language models such as CLIP enables efficient
task adaptation without updating model weights. However, it often leads to poor
confidence calibration and a degradation of the semantic structure in the text
embedding space. We propose a novel method for calibrating prompt tuning
framework that jointly preserves the geometric properties of pre-trained CLIP
embeddings while improving the predictive reliability. The standard cross-entropy
loss with two complementary regularizers is: (1) a mean-plus-variance margin
penalty that stabilizes interclass logit margins by jointly maximizing their average
while minimizing their dispersion, mitigating underconfidence on base classes and
overconfidence on novel ones; and (2) a text moment-matching loss that aligns
the first and second moments of the learned class text embeddings with those of
the frozen CLIP text features, preserving semantic dispersion. Through extensive
experiments across multiple datasets and prompt variants, we demonstrate that our
approach significantly reduces the Expected Calibration Error (ECE) compared to
competitive calibration techniques. Our code and models will be made publicly
available.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 9403
Loading