TOWARDS CALIBRATING PROMPT TUNING OF VISION- LANGUAGE MODELS

Ashshak Sharifdeen; Fahad Shamshad; Muhammad Akhtar Munir; Abhishek Basu; Mohamed Insaf Ismithdeen; Jeyapriyan Jeyamohan; Chathurika Sewwandi Silva; Salman Khan; Karthik Nandakumar; Muhammad Haris Khan

TOWARDS CALIBRATING PROMPT TUNING OF VISION- LANGUAGE MODELS

Ashshak Sharifdeen, Fahad Shamshad, Muhammad Akhtar Munir, Abhishek Basu, Mohamed Insaf Ismithdeen, Jeyapriyan Jeyamohan, Chathurika Sewwandi Silva, Salman Khan, Karthik Nandakumar, Muhammad Haris Khan

17 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Model Calibration, Promt Tuning, Train- Time prompt tuning, VLM, Long-Tail

TL;DR: Calibrating train time vision language models

Abstract: Prompt tuning of large-scale vision-language models such as CLIP enables efficient task adaptation without updating model weights. However, it often leads to poor confidence calibration and a degradation of the semantic structure in the text embedding space. We propose a novel method for calibrating prompt tuning framework that jointly preserves the geometric properties of pre-trained CLIP embeddings while improving the predictive reliability. The standard cross-entropy loss with two complementary regularizers is: (1) a mean-plus-variance margin penalty that stabilizes interclass logit margins by jointly maximizing their average while minimizing their dispersion, mitigating underconfidence on base classes and overconfidence on novel ones; and (2) a text moment-matching loss that aligns the first and second moments of the learned class text embeddings with those of the frozen CLIP text features, preserving semantic dispersion. Through extensive experiments across multiple datasets and prompt variants, we demonstrate that our approach significantly reduces the Expected Calibration Error (ECE) compared to competitive calibration techniques. Our code and models will be made publicly available.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 9403

Loading