InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding

Kirolos Ataallah; Chenhui Gou; Eslam Mohamed BAKR; Khushbu Pahwa; Jian Ding; Mohamed Elhoseiny

InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding

Kirolos Ataallah, Chenhui Gou, Eslam Mohamed BAKR, Khushbu Pahwa, Jian Ding, Mohamed Elhoseiny

20 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: video understanding, benchmark, long video benchmark, long video understanding

Abstract: Understanding long videos, ranging from tens of minutes to several hours, presents unique challenges in video comprehension. Despite the increasing importance of long-form video content, existing benchmarks primarily focus on shorter clips. To address this gap, we introduce InfiniBench a comprehensive benchmark for very long video understanding which presents 1)very long video duration, averaging 52.59 minutes per video 2)The largest number of question-answer pairs, 108.2K 3) Diversity in questions that examine nine different skills and include both multiple-choice questions and open-ended questions 4) Memory questions, such as Global Appearance that require remembering and tracking the visual aspects through the video. Using InfiniBench, we comprehensively evaluate existing Large Multi-Modality Models (LMMs) on each skill, including the commercial models such as GPT-4o and Gemini 1.5 Flash and the recent open-source models. The evaluation shows significant challenges in our benchmark. Our findings reveal that even leading AI models like GPT-4o and Gemini 1.5 Flash face challenges in achieving high performance in long video understanding, with average accuracies of just 56.01 % and 43.32 %, and average scores of 3.25 and 2.79 out of 5, respectively. Qwen2-VL matches Gemini's performance in the MCQ skills but lags significantly in open-ended question tasks. We hope this benchmark will stimulate the LMMs community towards long video and human-level understanding.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 2250

Loading