Abstract: We introduce MAVERIX (Multimodal audiovisual Evaluation and Recognition IndeX), a unified benchmark to probe
the video understanding in multimodal LLMs, encompassing video, audio, text inputs with human performance baselines. Although recent advancements in models with vision
and audio understanding capabilities have shown substantial progress, the field lacks a standardized evaluation framework to thoroughly assess their cross-modality comprehension performance. MAVERIX curates 2,556 questions from
700 videos, in the form of both multiple-choice and openended formats, explicitly designed to evaluate multimodal
models through questions that necessitate tight integration of
video and audio information, spanning a broad spectrum of
agentic scenarios. MAVERIX uniquely provides models with
audiovisual questions, closely mimicking the multimodal perceptual experiences available to humans during inference and
decision-making processes. To our knowledge, MAVERIX is
the first benchmark aimed explicitly at assessing comprehensive audiovisual integration in such granularity. Experiments
with state-of-the-art models, including Qwen 2.5 Omni and
Gemini 2.5 Flash-Lite, show performance around 64% accuracy, while human experts reach near-ceiling performance of
92.8%, exposing a substantial gap to human-level comprehension. With standardized evaluation protocols, a rigorously
annotated pipeline, and a public toolkit, MAVERIX establishes a challenging testbed for advancing audiovisual multimodal intelligence.
Loading