Detecting Unreliable Responses in Generative Vision-Language Models via Visual Uncertainty

Published: 05 Mar 2025, Last Modified: 01 Apr 2025QUESTION PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Uncertainty Estimation, Visual Uncertainty
Abstract: Building trust in vision-language models (VLMs) requires reliable uncertainty estimation (UE) to detect unreliable generations. Existing UE approaches often require access to internal model representations to train an uncertainty estimator, which may not always be feasible. Black-box methods primarily rely on language-based augmentations, such as question rephrasings or sub-question modules, to detect unreliable generations. However, the role of visual information in UE remains largely underexplored. To study this aspect of the UE research problem, we investigate a visual contrast approach that perturbs input images by removing visual evidence relevant to the question and measures changes in the output distribution. We hypothesize that for unreliable generations, the output token distributions from an augmented and unaugmented image remain similar despite the removal of key visual information in the augmented image. We evaluate this method on the A-OKVQA dataset using four popular pre-trained VLMs. Our results demonstrate that visual contrast, even when applied only at the first token, can be as effective as—if not always superior to—existing state-of-the-art probability-based black-box methods.
Submission Number: 36
Loading