Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models

ACL ARR 2024 June Submission650 Authors

12 Jun 2024 (modified: 03 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Video-adapted large language models (Video-LLMs) are pivotal for advancing artificial general intelligence (AGI) in video understanding. Despite progress, existing methods rarely undergo comprehensive assessment from an AGI construction perspective. We propose that an ideal video intelligence model should possess three essential abilities: (i) Video-exclusive Understanding, crucial for tasks like event summarization where direct video content analysis is paramount; (ii) Prior Knowledge-based Question-Answering, essential for applications needing contextual insights such as in-depth sports analysis or cultural understanding in music videos and television shows; (iii) Comprehension and Decision-making, vital for predictive tasks in complex environments like 3D scene navigation or autonomous vehicle guidance. To systematically evaluate these abilities, we introduce \textit{Video-Bench}, an ability-oriented benchmark encompassing real-world video data and meticulously designed QA pairs, accompanied by an automated evaluation toolkit. Our analysis of 8 leading Video-LLMs show a significant gap in achieving human-like video understanding, underscoring the need for advancements in video comprehension AGI.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: vision language navigation; vision question answering; cross-modal application; video processing; multimodality;
Languages Studied: English
Submission Number: 650
Loading