Understanding and Mitigating Miscalibration in Prompt Tuning for Vision-Language Models

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We propose Dynamic Outlier Regularization to improve fine-tuned CLIP’s calibration under various evaluations without compromising vanilla fine-tuning.
Abstract: Confidence calibration is critical for the safe deployment of machine learning models in the real world. However, such issue in vision-language models like CLIP, particularly after fine-tuning, has not been fully addressed. In this work, we demonstrate that existing prompt tuning methods usually lead to a trade-off of calibration between base and new classes: the cross-entropy loss used in standard fine-tuning (e.g., CoOp) causes overconfidence in new classes by increasing textual label divergence, whereas regularization-based tuning (e.g., KgCoOp) maintains the confidence level but results in underconfidence in base classes due to the improved accuracy. Inspired by the observations, we introduce Dynamic Outlier Regularization (DOR) to ensure the confidence calibration on both base and new classes after fine-tuning. In particular, we propose to minimize the feature deviation of novel textual labels (instead of base classes) sampled from a large vocabulary. In effect, DOR prevents the increase in textual divergence for new labels while easing restrictions on base classes. Extensive experiments demonstrate that DOR can enhance the calibration performance of current fine-tuning methods on base and new classes.
Lay Summary: When vision-language foundational models like CLIP are fine-tuned for specific tasks, they can sometimes become too sure or not sure enough about their predictions. This can make them less dependable for important real-world uses, like in healthcare or autonomous vehicles, where knowing how confident a model should be is key to safety. Our study found that current fine-tuning methods either make the model overconfident in new classes or not confident enough in familiar ones. To solve this, we proposed a new method called Dynamic Outlier Regularization (DOR). DOR carefully balances how the model handles both new and familiar classes, ensuring it stays reliable without being overly certain. Our experiments show that DOR makes CLIP’s predictions more accurate and trustworthy for real-world applications.
Link To Code: https://github.com/ml-stat-Sustech/Outlier-Calibration
Primary Area: Social Aspects->Robustness
Keywords: Vision-Language Models, CLIP, Confidence Calibration, Fine-tuning
Submission Number: 5881
Loading