MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Published: 01 Jan 2024, Last Modified: 16 Feb 2025CVPR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: With the rapid development of Multi-modal Large language Models (MLLMs), a number of diagnostic bench-marks have recently emerged to evaluate the comprehension capabilities of these models. However, most bench-marks predominantly assess spatial understanding in the static image tasks, while overlooking temporal understanding in the dynamic video tasks. To alleviate this issue, we introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench, which covers 20 chal-lenging video tasks that cannot be effectively solved with a single frame. Specifically, we first introduce a novel static-to-dynamic method to define these temporal-related tasks. By transforming various static tasks into dynamic ones, we enable the systematic generation of video tasks that require a broad spectrum of temporal skills, ranging from perception to cognition. Then, guided by the task definition, we au-tomatically convert public video annotations into multiple-choice QA to evaluate each task. On one hand, such a distinct paradigm allows us to build MVBench efficiently, without much manual intervention. On the other hand, it guarantees evaluation fairness with ground-truth video an-notations, avoiding the biased scoring of LLMs. More-over, we further develop a robust video MLLM baseline, i.e., VideoChat2, by progressive multi-modal training with di-verse instruction-tuning data. The extensive results on our MVBench reveal that, the existing MLLMs are far from sat-isfactory in temporal understanding, while our VideoChat2 largely surpasses these leading models by over 15% on MVBench. All models and data are available at https://github.com/OpenGVLab/Ask-Anything.
Loading