Beyond Single Frames:  Can LMMs Comprehend  Temporal and Contextual Narratives in Image Sequences?

Beyond Single Frames: Can LMMs Comprehend Temporal and Contextual Narratives in Image Sequences?

ACL ARR 2025 February Submission6746 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Multimodal Models (LMMs) have demonstrated strong performance across various visual-language tasks. However, existing benchmarks primarily focus on evaluating LMMs' ability to understand individual images, leaving the potential of advanced LMMs to analyze image sequences largely unexplored. To address this gap, we introduce StripCipher, a comprehensive benchmark designed to assess LMMs' capabilities in understanding and reasoning with sequential images. StripCipher consists of a human-annotated dataset and three challenging subtasks: visual narrative comprehension, contextual frame prediction, and temporal narrative reordering. By evaluating 16 prominent LMMs on StripCipher, such as GPT-4o and Qwen2.5VL, we identify a significant performance gap between current LMMs and human capabilities in understanding temporal and contextual relationships in image sequences. For instance, GPT-4o achieves only $23.93\%$ accuracy in the reordering subtask, which is 56.07\% lower than human performance. Further quantitative analysis reveals several key factors affecting LMMs’ performance in sequential understanding, underscoring the fundamental challenges that remain in the development of LMMs.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: vision question answering, multimodality

Contribution Types: NLP engineering experiment, Data resources, Data analysis

Languages Studied: language

Submission Number: 6746

Loading