TemporalBench: Towards Fine-grained Temporal Understanding for  Multimodal Video  Models

Mu Cai; Reuben Tan; Jianrui Zhang; Bocheng Zou; Kai Zhang; Feng Yao; Fangrui Zhu; Jing Gu; Yiwu Zhong; Yuzhang Shang; Yao Dou; Jaden Park; Jianfeng Gao; Yong Jae Lee; Jianwei Yang

TemporalBench: Towards Fine-grained Temporal Understanding for Multimodal Video Models

Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, Yao Dou, Jaden Park, Jianfeng Gao, Yong Jae Lee, Jianwei Yang

18 Sept 2024 (modified: 14 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: video, benchmark, multimodel

TL;DR: We introduce a finegrained multimodal video understanding benchmark

Abstract: Understanding fine-grained temporal dynamics is crucial for video understanding. Yet, popular video benchmarks, such as MSRVTT and TGIF, often fail to effectively evaluate AI models' temporal reasoning abilities due to the lack of fine-grained temporal annotations. As a result, text-based models, leveraging strong language priors, often perform comparably to video models, and image-trained models have been reported to outperform their video-trained counterparts on MSRVTT and TGIF. This paper introduces a new TemporalBench benchmark for fine-grained temporal event understanding in videos. TemporalBench, sourced from a diverse video datasets, consists of $\sim$10K pairs of video description questions, derived from $\sim$2K high-quality human-annotated video captions. Uniquely, our benchmark provides fine-grained temporal annotations to evaluate models' temporal reasoning abilities. Our results show that state-of-the-art models like GPT-4o achieve only 38.0\% multiple binary QA accuracy on TemporalBench, demonstrating a significant human-AI gap in temporal understanding. We hope that TemporalBench is instrumental to fostering research on improving models' temporal reasoning capabilities.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 1635

Loading