Abstract: We propose a novel VQA dataset, BloomVQA,
to facilitate comprehensive evaluation of large
vision-language models on comprehension
tasks. Unlike current benchmarks that often
focus on fact-based memorization and simple
reasoning tasks without theoretical grounding,
we collect multiple-choice samples based on
picture stories that reflect different levels of
comprehension, as laid out in Bloom’s Taxonomy, a classic framework for learning assessment widely adopted in education research.
Our data maps to a novel hierarchical graph
representation which enables automatic data
augmentation and novel measures characterizing model consistency. We perform graded
evaluation and reliability analysis on recent
multi-modal models. In comparison to lowlevel tasks, we observe decreased performance
on tasks requiring advanced comprehension
and cognitive skills with up to 38.0% drop in
VQA accuracy. In comparison to earlier models, GPT-4V demonstrates improved accuracy
over all comprehension levels and shows a tendency of bypassing visual inputs especially for
higher-level tasks. Current models also show
consistency patterns misaligned with human
comprehension in various scenarios, demonstrating the need for improvement based on
theoretically-grounded criteria. The dataset
can be accessed at https://huggingface.
co/datasets/ygong/BloomVQA.
Loading