Keywords: Vision language model; Uncertainty; Multi-modality
Abstract: Recent advances in vision language models (VLMs), such as GPT-4o, have revolutionized visual reasoning by enabling zero-shot task completion through natural language instructions. In this paper, we study VLMs' ability to detect input ambiguities, i.e., aleatoric uncertainty. Our key finding is that VLMs can effectively identify ambiguous inputs by simply including an instruction to output "Unknown" when uncertain. Through experiments on corrupted ImageNet and ``OOD'' detection tasks, we demonstrate that VLMs successfully reject uncertain inputs while maintaining high accuracy on confident predictions. This capability for implicitly quantifying uncertainty emerges without additional training or in-context learning, distinguishing VLMs from traditional vision models that often produce overconfident predictions on ambiguous inputs.
Submission Number: 31
Loading