OmniEval: A Benchmark for Evaluating Omni-modal Models with Visual, Auditory, and Textual Inputs

OmniEval: A Benchmark for Evaluating Omni-modal Models with Visual, Auditory, and Textual Inputs

ICLR 2026 Conference Submission16567 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Omni models, Benchmark, Multimodality

Abstract: In this paper, we introduce OmniEval, a benchmark for evaluating multimodal Chinese and English video understanding, which encompasses visual, auditory, and textual inputs. Compared with existing benchmarks, our OmniEval has several distinctive features: (i) Full-modal collaboration: We design evaluation tasks that highlight the strong coupling between audio and video, requiring models to effectively leverage the collaborative perception of all modalities; (ii) Diversity of videos and tasks: OmniEval includes 1,000 audio-visual synchronized videos, with 307 Chinese videos and 558 English videos, systematically categorized into four major domains. (iii) Diversity and granularity of tasks: OmniEval contains 2783 question-answer pairs, comprising 1412 open-ended questions and 1371 multiple-choice questions. These questions are divided into four major task types and 12 subtask types to achieve comprehensive evaluation. Among them, we have introduced a more granular video localization task, which named as Grounding. Based on our OmniEval, we have extensively evaluated a variety of state-of-the-art models. The experimental results indicate that existing models face significant challenges in understanding the real world, with the best accuracy rate being only 10%. We hope that our OmniEval can provide a platform for evaluating the ability to construct and understand coherence from the context of all modalities.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 16567

Loading