Keywords: Omni models, Benchmark, Multimodality
Abstract: In this paper, we introduce OmniEval, a benchmark for evaluating multimodal Chinese and English video understanding, which encompasses visual, auditory, and textual inputs. Compared with existing benchmarks, our OmniEval has several distinctive features: (i) Full-modal collaboration: We design evaluation tasks that highlight the strong coupling between audio and video, requiring models to effectively leverage the collaborative perception of all modalities; (ii) Diversity of videos and tasks: OmniEval includes 1,000 audio-visual synchronized videos, with 307 Chinese videos and 558 English videos, systematically categorized into four major domains. (iii) Diversity and granularity of tasks: OmniEval contains 2783 question-answer pairs, comprising 1412 open-ended questions and 1371 multiple-choice questions. These questions are divided into four major task types and 12 subtask types to achieve comprehensive evaluation. Among them, we have introduced a more granular video localization task, which named as Grounding. Based on our OmniEval, we have extensively evaluated a variety of state-of-the-art models. The experimental results indicate that existing models face significant challenges in understanding the real world, with the best accuracy rate being only 10%. We hope that our OmniEval can provide a platform for evaluating the ability to construct and understand coherence from the context of all modalities.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 16567
Loading