Uncertainty Quantification for Multimodal Large Language Models with Coherence-adjusted Semantic Volume
Keywords: Uncertainty quantification, Multimodal LLMs, MLLM
TL;DR: We propose a training-free framework to estimate the uncertainty MLLM outputs for tasks involving multimodal input that is modality-agnostic and generalizable across various modalities.
Abstract: Multimodal Large Language Models (MLLMs) hold promise in tackling tasks comprising multiple input modalities, but may produce seemingly plausible but erroneous output, making them hard to trust and deploy. Accurate uncertainty metrics during inference could enable efficient escalation of queries from MLLMs to human experts or larger models for improved performance. However, existing uncertainty metrics are designed and tested only for specific modalities, and require external verifiers, additional training, or high computational resources, while struggling to handle scenarios such as out-of-distribution (OOD) or adversarial settings. To overcome these limitations, we propose UMPIRE, a training-free framework to estimate MLLM uncertainty for tasks involving various input modalities at inference time without external tools, based on the diversity of the MLLM's responses computed by its enclosed semantic volume that is adjusted with internal indicators of each response's coherence. UMPIRE does not require external modality-specific interventions and instead rely on the MLLM's own internal modality features, allowing it to generalize across modalities. We provide theoretical analysis to offer intuition on how UMPIRE could satisfy key desiderata, and empirically show that it outperforms baselines in predicting incorrect responses and providing calibrated uncertainty estimates across different input modality tasks involving text, image, text and video, including for OOD, adversarial and domain-specific data settings. We also show that UMPIRE performs well for uncertainty quantification on generation tasks beyond text, such as image and audio generation.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 24000
Loading