Keywords: MLLMs, Video, Evaluations, Benchmark, VQA, Video Understanding
TL;DR: The paper presents a new evaluation benchmark for VQA supporting two tasks evaluating multiple teams under controlled environment.
Abstract: Recent advancements in large multi-modal models have significantly improved AI's ability to process and understand complex data across multiple modalities, including text, images, and video. However, true comprehension of video content remains a formidable challenge, which requires AI systems to integrate visual, auditory, and temporal information to answer questions in a meaningful way. In this paper, we present a new Video Question Answering (VQA) evaluation benchmark which aims to rigorously assess the capabilities of state-of-the-art multi-modal models in understanding and reasoning about video content. Participating teams developed and tested models that answer a diverse set of video clips-based questions covering various levels of complexity, from factual retrieval to complex reasoning. The benchmark serves as a critical evaluation framework to measure progress in video understanding, helping identify strengths and weaknesses in current multi-modal AI architectures. The main advantages of this benchmark data include high quality human annotations by dedicated trained in-house human workers, employing real-world data in the wild, and adopting a shared task paradigm under controlled conditions to evaluate multiple systems fairly. The benchmark completed its first pilot year, which included two sub-tasks: Answer Generation Task and Multiple Choice Task. We plan to continue running the benchmark annually by adding new data sources, refining metrics, and adding new question categories.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 24
Loading