Keywords: video, benchmark, multimodel
TL;DR: We introduce a finegrained multimodal video understanding benchmark
Abstract: Understanding fine-grained temporal dynamics is crucial for video understanding. Yet, popular video benchmarks, such as MSRVTT and TGIF, often fail to effectively evaluate AI models' temporal reasoning abilities due to the lack of fine-grained temporal annotations.
As a result, text-based models, leveraging strong language priors, often perform comparably to video models, and image-trained models have been reported to outperform their video-trained counterparts on MSRVTT and TGIF. This paper introduces a new TemporalBench benchmark for fine-grained temporal event understanding in videos. TemporalBench, sourced from a diverse video datasets, consists of $\sim$10K pairs of video description questions, derived from $\sim$2K high-quality human-annotated video captions. Uniquely, our benchmark provides fine-grained temporal annotations to evaluate models' temporal reasoning abilities. Our results show that state-of-the-art models like GPT-4o achieve only 38.0\% multiple binary QA accuracy on TemporalBench, demonstrating a significant human-AI gap in temporal understanding. We hope that TemporalBench is instrumental to fostering research on improving models' temporal reasoning capabilities.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1635
Loading