Keywords: Multimodal Large Language Models, Uncertainty Evaluation, Visual Question Answering, Distillation
Abstract: While Multimodal Large Language Models (MLLMs) demonstrate significant potential in addressing complex multimodal tasks, they often produce plausible yet incorrect responses that limit their practical deployment, highlighting the critical need for reliable uncertainty evaluation. Existing metrics for assessing model uncertainty typically require extensive labeled datasets and rely on token-level confidence, which might be inadequate for open-ended multimodal tasks. To address these issues, we propose an Uncertainty-Aware Self-Assessment Framework (UnSAF), which explicitly incorporates the key question—Do MLLMs know what they don’t know?—into the evaluation procedure. Specifically, UnSAF first prompts MLLMs to generate a set of both answerable and unanswerable questions, then requires the models to answer these self-generated questions. The responses are then categorized into four distinct types, namely true answerable, false answerable, true unanswerable, and false unanswerable, and this ultimately yields an interpretable and label-free uncertainty-aware F1 (UnF1) score. We conduct extensive studies across both open-source and commercial MLLMs based on UnSAF. Our experiments not only demonstrate the effectiveness of UnSAF compared to conventional metrics but also reveal intriguing observations. Notably, we identify a clear positive correlation between the UnF1 score and model scale, which motivates the use of knowledge distillation to enhance uncertainty awareness in open-source, smallerscale MLLMs. Unlike simply transferring question-answering ability from larger models, we incorporate uncertainty-aware question generation into the distillation framework by teaching the student model to generate both answerable and unanswerable questions in response to different types of instructions. Experiments show that distilling uncertainty-aware question generation capability markedly enhances MLLMs’ uncertainty awareness without degrading original task performance and also noticeably reduces hallucinations.
Supplementary Material: zip
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Submission Number: 2810
Loading