TOWARDS CALIBRATING PROMPT TUNING OF VISION- LANGUAGE MODELS

17 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Model Calibration, Promt Tuning, Train- Time prompt tuning, VLM, Long-Tail
TL;DR: Calibrating train time vision language models
Abstract: Prompt tuning of large-scale vision-language models such as CLIP enables efficient task adaptation without updating model weights. However, it often leads to poor confidence calibration and a degradation of the semantic structure in the text embedding space. We propose a novel method for calibrating prompt tuning framework that jointly preserves the geometric properties of pre-trained CLIP embeddings while improving the predictive reliability. The standard cross-entropy loss with two complementary regularizers is: (1) a mean-plus-variance margin penalty that stabilizes interclass logit margins by jointly maximizing their average while minimizing their dispersion, mitigating underconfidence on base classes and overconfidence on novel ones; and (2) a text moment-matching loss that aligns the first and second moments of the learned class text embeddings with those of the frozen CLIP text features, preserving semantic dispersion. Through extensive experiments across multiple datasets and prompt variants, we demonstrate that our approach significantly reduces the Expected Calibration Error (ECE) compared to competitive calibration techniques. Our code and models will be made publicly available.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 9403
Loading