MT-Video-Bench:  A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues

Yaning Pan; Zekun Moore Wang; Xie Qianqian; Yongqian Wen; Yuanxing Zhang; Guohui Zhang; 胡皓轩; Zhiyu Pan; Yibing Huang; Zhidong Gan; Yonghong Lin; An Ping; Tianhao Peng; Jiaheng Liu

MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues

Yaning Pan, Zekun Moore Wang, Xie Qianqian, Yongqian Wen, Yuanxing Zhang, Guohui Zhang, 胡皓轩, Zhiyu Pan, Yibing Huang, Zhidong Gan, Yonghong Lin, An Ping, Tianhao Peng, Jiaheng Liu

01 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Large Language Models, Multi-turn Video Dialogue, Benchmark Evaluation

Abstract: The recent development of Multimodal Large Language Models (MLLMs) has significantly advanced AI’s ability to understand visual modalities. However, existing evaluation benchmarks remain limited to single-turn question answering, overlooking the complexity of multi-turn dialogues in real-world scenarios. To bridge this gap, we introduce \textbf{MT-Video-Bench}, a holistic video understanding benchmark for evaluating MLLMs in multi-turn dialogues. Specifically, our MT-Video-Bench mainly assesses six core competencies that focus on perceptivity and interactivity, encompassing 987 meticulously curated multi-turn dialogues from diverse domains. These capabilities are rigorously aligned with real-world applications, such as interactive sports analysis and multi-turn video-based intelligent tutoring. With MT-Video-Bench, we extensively evaluate various state-of-the-art open-source and closed-source MLLMs, revealing their significant performance discrepancies and limitations in handling multi-turn video dialogues. The benchmark will be publicly available to foster future research.

Primary Area: datasets and benchmarks

Submission Number: 427

Loading