Can Multimodal Foundation Models Perform Visual Temporal Reasoning?

ICLR 2025 Conference Submission13155 Authors

28 Sept 2024 (modified: 25 Nov 2024)ICLR 2025 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: visual temporal reasoning, video understanding, benchmark, vision-language benchmark, video-language models, evaluation
Abstract: Existing benchmarks often highlight the remarkable performance achieved by state-of-the-art Multimodal Foundation Models (MFMs) in leveraging temporal context for video understanding. However, *how well do the models truly perform visual temporal reasoning*? Our study of existing benchmarks shows that this capability of MFMs is likely overestimated as many questions can be solved by using a single, few, or out-of-order frames. To systematically examine current visual temporal reasoning tasks, we propose three principles with corresponding metrics: (1) *Multi-Frame Gain*, (2) *Frame Order Sensitivity*, and (3) *Frame Information Disparity*. Following these principles, we introduce **TVBench**, **T**emporal Reasoning **V**ideo Understanding **Bench**mark, a novel benchmark crafted to rigorously assess MFMs' temporal reasoning capabilities in video understanding. TVBench comprises 1,484 carefully curated, *human-annotated* questions spanning six tasks (i.e. *action count, direction, rotation, shape & trend, velocity & frequency, and visual cues*), applied to 1,417 videos, including 805 self-recorded and -generated videos, that encompass human-centric, real-world, and simulated scenarios. Our comprehensive evaluation reveals a human-model performance gap of 57.3% with the best-performing model. Moreover, our in-depth analysis uncovers more fundamental limitations beyond this gap in current MFMs. While they can accurately recognize events in isolated frames, they fail to interpret these frames as a continuous sequence. We believe TVBench will serve as a crucial testbed for evaluating the next-generation MFMs and as a call to the community to develop AI systems capable of comprehending the human world dynamics through the video modality.
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 13155
Loading