Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?

ACL ARR 2025 February Submission3976 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: This paper introduces the new and challenging TempVS benchmark, which focuses on temporal grounding and reasoning capabilities of Multimodal Large Language Models (MLLMs) in image sequences. TempVS consists of three main tests (i.e., event relation inference, sentence ordering and image ordering) and each accompanied with a basic grounding test, yielding a total of 2,085 annotated image sequences and 15k multiple-choice questions. TempVS requires MLLMs to rely on both visual and linguistic modalities to understand the temporal order of events. We extensively evaluate 38 state-of-the-art MLLMs, demonstrating that models struggle to solve TempVS. Our analysis reveals a substantial performance gap between current MLLMs and human capabilities, accompanied by fine-grained insights that suggest promising directions for future research.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Multimodal Large Language Models, Multi-image Understanding, Benchmark and Dataset, Vision and Languag Learning
Contribution Types: Model analysis & interpretability, Data resources, Data analysis
Languages Studied: English
Submission Number: 3976
Loading