TemporalBench: Evaluating Fine-Grained Temporal Dynamics Understanding for Multimodal Models

01 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Video; Benchmark; Temporal
TL;DR: TemporalBench: Evaluating Fine-Grained Temporal Dynamics Understanding for Multimodal Models
Abstract: Understanding fine-grained temporal dynamics is crucial for multimodal video comprehension and generation. Due to the lack of fine-grained temporal annotations, existing video benchmarks mostly resemble static image benchmarks and are insufficient at evaluating models for temporal understanding. In this paper, we introduce *TemporalBench*, a benchmark dedicated to evaluating **fine-grained temporal understanding** in videos. *TemporalBench* consists of $\sim$15K video question-answer pairs, derived from $\sim$2K high-quality human annotations detailing the temporal dynamics. As a result, our benchmark provides a unique testbed for evaluating various temporal understanding and reasoning abilities such as *action frequency, motion magnitude, event order*, etc. Moreover, it enables evaluations on various tasks such as both short and long video understanding, as well as different models including multimodal embedding models and text generation models. Furthermore, we notice a critical pitfall for multi-choice QA where LLMs can detect the subtle changes in negative captions and find a "centralized” description as a cue for its prediction. To correct such bias, we propose **Multiple Binary Accuracy (MBA)**, a new metric for dense temporal understanding. Results show that state-of-the-art models like GPT-4o achieve only **38.5%** short video QA, demonstrating a significant gap ($\sim$30%) between humans and AI in temporal understanding. We hope that *TemporalBench* can foster research on improving models' temporal reasoning capabilities. Both dataset and code will be available.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 588
Loading