Keywords: Vision-language models, Prompt learning, Confidence calibration, Contrast metric
Abstract: Vision-language models (VLMs), such as CLIP, have demonstrated exceptional generalization capabilities and can quickly adapt to downstream tasks through prompt tuning. Unfortunately, in classification tasks involving non-training classes, fine-tuned VLMs often overfit to train classes, resulting in a misalignment between confidence scores and actual accuracy on unseen classes, which significantly undermines their reliability in real-world deployments. Existing confidence calibration methods typically require training parameters or analyzing features from the training dataset, restricting their ability to generalize unseen classes without corresponding train data. Moreover, VLM-specific calibration methods rely solely on text features from train classes as calibration indicators, which inherently limits their ability to calibrate train classes and other evaluation settings, like cross-dataset and domain-generalization settings. To address these challenges, we propose a multimodal calibration method $\textbf{Contrast-Aware Calibration (CAC)}$. Building on the original CLIP's zero-shot adaptability and the conclusion from empirical analysis that poor intra-class and inter-class discriminative ability on unseen classes is the root cause, we calculate calibration weights based on the contrastive difference between the original and fine-tuned CLIP. This method is not only effective for calibrating unseen classes but also overcomes the limitations of previous VLM calibration methods that struggle to calibrate train classes and other settings. In multiple setting experiments with 5 fine-tuning methods, CAC achieves strong calibration in all settings without sacrificing accuracy.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 8719
Loading