Keywords: benchmark, evaluation, VQA, comics
Abstract: We introduce Comic Visual Question Answering (\textbf{ComicVQA}), a comics-based benchmark for evaluating MLLMs on visual reasoning. ComicVQA comprises of (i) \textbf{Missing Panel Prediction}, testing fine-grained visual grounding and (ii) \textbf{Panel Sorting}, which evaluates sequential narrative understanding. Proprietary models achieve up to 62.6\% on Missing Panel Prediction and 46.4\% on Panel Sorting, whereas open-source models reach only 47.7\% and 26.9\%, respectively. In contrast, human annotators achieve over 83\% accuracy on both tasks, revealing a large gap between current models and human-level multimodal understanding in comics. Through controlled ordering ablations and a detailed error taxonomy, we show that current MLLMs rely primarily on coarse temporal cues and struggle with fine-grained visual reasoning. These findings demonstrate ComicVQA as a diagnostic benchmark for advancing multimodal visual reasoning in comics.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, automatic evaluation of datasets, vision question answering
Contribution Types: Data resources
Languages Studied: English
Submission Number: 9534
Loading