Uncertainty Quantification for Multimodal Large Language Models with Coherence-adjusted Semantic Volume

Gregory Kang Ruey Lau; Hieu Dao; Nicole Kan Hui Lin; Bryan Kian Hsiang Low

Uncertainty Quantification for Multimodal Large Language Models with Coherence-adjusted Semantic Volume

Gregory Kang Ruey Lau, Hieu Dao, Nicole Kan Hui Lin, Bryan Kian Hsiang Low

20 Sept 2025 (modified: 09 Oct 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Uncertainty quantification, Multimodal LLMs, MLLM

TL;DR: We propose a training-free framework to estimate the uncertainty MLLM outputs for tasks involving multimodal input that is modality-agnostic and generalizable across various modalities.

Abstract: Multimodal Large Language Models (MLLMs) hold promise in tackling tasks comprising multiple input modalities, but may produce seemingly plausible but erroneous output, making them hard to trust and deploy. Accurate uncertainty metrics during inference could enable efficient escalation of queries from MLLMs to human experts or larger models for improved performance. However, existing uncertainty metrics are designed and tested only for specific modalities, and require external verifiers, additional training, or high computational resources, while struggling to handle scenarios such as out-of-distribution (OOD) or adversarial settings. To overcome these limitations, we propose UMPIRE, a training-free framework to estimate MLLM uncertainty for tasks involving various input modalities at inference time without external tools, based on the diversity of the MLLM's responses computed by its enclosed semantic volume that is adjusted with internal indicators of each response's coherence. UMPIRE does not require external modality-specific interventions and instead rely on the MLLM's own internal modality features, allowing it to generalize across modalities. We provide theoretical analysis to offer intuition on how UMPIRE could satisfy key desiderata, and empirically show that it outperforms baselines in predicting incorrect responses and providing calibrated uncertainty estimates across different input modality tasks involving text, image, text and video, including for OOD, adversarial and domain-specific data settings. We also show that UMPIRE performs well for uncertainty quantification on generation tasks beyond text, such as image and audio generation.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 24000

Loading