Abstract: Large Multimodal Models (LMMs) have demonstrated strong performance on vision-language benchmarks, yet current evaluations predominantly focus on single-image reasoning.
In contrast, real-world scenarios always involve understanding sequences of images. A typical scenario is comic strips understanding, which requires models to perform nuanced visual reasoning beyond surface-level recognition.
To address this gap, we introduce STRIPCIPHER , a benchmark designed to evaluate the model ability on understanding implicit narratives in silent comics. STRIPCIPHER is a high-quality, human-annotated dataset featuring fine-grained annotations and comprehensive coverage of varying difficulty levels.
It comprises three tasks: visual narrative comprehension, contextual frame prediction, and temporal narrative reordering.
% , covering various difficulty.
Notably, evaluation results on STRIPCIPHER reveals a significant gap between current LMMs and human performance---e.g., GPT-4o achieves only 23.93\% accuracy in the reordering task, 56.07\% below human levels.
These findings underscore the limitations of current LMMs in implicit visual narrative understanding and highlight opportunities for advancing sequential multimodal reasoning.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision question answering, multimodality
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: language
Submission Number: 5387
Loading