The Art of Saying "Maybe": A Conformal Lens for Uncertainty Benchmarking in VLMs

The Art of Saying "Maybe": A Conformal Lens for Uncertainty Benchmarking in VLMs

ACL ARR 2025 May Submission7857 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Vision-Language Models (VLMs) have achieved remarkable progress in complex visual understanding across scientific and reasoning tasks. While performance benchmarking has advanced our understanding of these capabilities, the critical dimension of uncertainty quantification has received insufficient attention. Therefore, we present a comprehensive uncertainty benchmarking study using conformal prediction, evaluating 16 state-of-the-art VLMs (both open-source and proprietary) across 6 multimodal datasets using 3 distinct scoring functions. Our findings demonstrate that larger models consistently exhibit better uncertainty quantification; models that know more also know better what they don't know. More certain models achieve higher accuracy, while mathematical and reasoning tasks elicit poorer uncertainty performance across all models compared to other domains. This work establishes a foundation for reliable uncertainty evaluation in multimodal systems.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Uncertainty Quantification, Vision Language Models, Conformal Prediction

Contribution Types: Model analysis & interpretability, Data analysis

Languages Studied: English

Submission Number: 7857

Loading