Detecting Misbehaviors of Large Vision-Language Models by Evidential Uncertainty Quantification

Published: 26 Jan 2026, Last Modified: 27 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Misbehavior Detection, Large Vision-Language Model, Evidential Theory, Uncertainty
Abstract: Large vision-language models (LVLMs) have achieved substantial advances in multimodal understanding. However, when presented with \textcolor{black}{challenging or distribution-shifted inputs}, they frequently produce unreliable or even harmful content, \textcolor{black}{such as hallucinations or toxic responses. We refer to such misalignments with human expectations as \emph{misbehaviors} of LVLMs, which} raise serious concerns for their deployment in critical applications. \textcolor{black}{Existing research have disclosed that such misbehaviors are closely linked to model uncertainty. We find they primarily stem from two distinct sources of epistemic uncertainty: internal contradictions (conflict) and the absence of supporting information (ignorance).} While existing uncertainty quantification methods typically capture only total predictive uncertainty, they struggle to distinguish between these underlying causes. To address this gap, we propose Evidential Uncertainty Quantification (EUQ), \textcolor{black}{a training-free framework that explicitly decomposes epistemic uncertainty into conflict (CF) and ignorance (IG)}. Specifically, we interpret features from the model output head as either supporting (positive) or opposing (negative) evidence. Leveraging Dempster-Shafer Theory of belief functions, we aggregate this evidence to quantify internal conflict and knowledge gaps within a single forward pass. We extensively evaluate EUQ across four misbehavior categories, including hallucinations, jailbreaks, adversarial vulnerabilities, and out-of-distribution (OOD) failures using state-of-the-art LVLMs. Experimental results demonstrate that EUQ consistently outperforms strong baselines, \textcolor{black}{achieving relative improvements of up to 10.5\% in AUROC.} \textcolor{black}{Our evaluation further reveals} that hallucinations correspond to high internal conflict and OOD failures to high ignorance. \textcolor{black}{Furthermore, a layer-wise evidential uncertainty dynamics analysis provides a novel perspective on the evolution of internal representations.} The source code is available at \url{https://github.com/HT86159/EUQ}.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 15301
Loading