CMTReorder: What is the Right Timeline of these Cross-Modal Fragments?

ACL ARR 2025 February Submission6872 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Timeline reordering is a crucial task in time series reasoning, where events need to be sorted along a temporal axis across various formats. While recent advancements in multimodal large language models (MLLMs) have shown promise in single-modal temporal reasoning, real-world data is often mixed and unstructured, with modalities existing independently without clear pairings. To address this gap, we introduce a novel task, Cross-Modal Timeline Reordering (**CMTReorder**), which evaluates the cross-modal temporal reasoning ability of MLLMs. The task consists of two tests: Cross-modal Direct Ordering, where models reorder the timeline directly, and Cross-modal Binary Decision, where models first make binary decisions on temporal relationships before reordering. We also present the MixStoryLine dataset, which includes text and image narratives from different time points. We evaluate CMTReorder using multiple MLLMs, including GPT-4o, LLaMA, and Deepseek. The results reveal significant challenges: GPT-4o achieves 24% consistent accuracy in direct ordering, 66.88% accuracy in binary judgment, and 9% consistent accuracy in the following reordering, with other models performing less effectively. These findings highlight the difficulty of cross-modal temporal inference and underscore the need for further improvements in model performance, while also offering insights for real-world applications.
Paper Type: Short
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: image text matching, vision question answering, cross-modal information extraction, cross-modal application
Contribution Types: Data analysis, Position papers, Theory
Languages Studied: English
Submission Number: 6872
Loading