Human Uncertainty-Aware Reliable Data Selection and Efficient Annotation for Visual Question Answering

Jian Lan; Zhicheng Liu; Thomas Seidl

Human Uncertainty-Aware Reliable Data Selection and Efficient Annotation for Visual Question Answering

Jian Lan, Zhicheng Liu, Thomas Seidl

Published: 29 Sept 2025, Last Modified: 12 Oct 2025NeurIPS 2025 - Reliable ML WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Vision-Language Model, Visual Question Answering, Human Uncertainty, Imperfect and Unreliable Data

Abstract: Large vision-language models (VLMs) achieve strong performance but still depend on supervised fine-tuning (SFT) with massive annotated datasets, which are both costly and inherently noisy due to human annotation, especially when human uncertainty exists. We find that the degree of human uncertainty affects the reliability of a sample, thereby casting doubt on its suitability for SFT. It is still unknown how to use human uncertainty for training when such imperfect data exists. Moreover, current mainstream SFT method simply requires annotation for the full dataset, causing unnecessary annotation overhead. In this work, we revisit Visual Question Answering (VQA), one of the most important and commonly studied task for VLMs. We study data reliability and label efficiency based on VQA. To this end, we propose a $\textbf{h}$uman $\textbf{u}$ncertainty-aware $\textbf{r}$eliable data selection and efficient label $\textbf{a}$nnotation method (HURA). HURA's advantages are twofold: firstly, it filters harmful samples and prioritizes more reliable samples that indeed improving model performances (both accuracy and human alignment), reducing computational costs. Secondly, it does not require huge amount of human annotation on overall dataset, reducing human annotation costs and avoiding potential manual noise. We find that training with only a small random subset ($\textasciitilde10$%) of data already recovers most of the full-data performance ($\textasciitilde90$%), while not all samples are equally reliable to improve model performances: high human uncertainty samples contribute little or even do harm to training, while medium- and low- human uncertainty samples provide more improvements. We also find that models are able to proceed self-training with a provided seed set, thereby reducing both annotation reliance and cost. Our experimental results demonstrated that HURA is effective for recent state-of-the-art VQA models on VQAv2 dataset. HURA highlights an important direction for learning reliably from imperfect data: understanding and leveraging uncertainty, rather than simply scaling up the size of training data.

Submission Number: 133

Loading