AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs

09 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: multimodal benchmark, audiovisual intelligence, multimodal large language models
TL;DR: Benchmarking Human-like Audio-Visual Intelligence of Omni-MLLMs
Abstract: Recent breakthroughs in Omni-Multimodal Large Language Models (Omni-MLLMs), such as GPT-4o, have showcased remarkable progress in integrating visual and audio modalities with language, bringing us closer to human-like audio-visual intelligence. However, a critical gap remains: the lack of systematic benchmarks to rigorously evaluate these models’ audio-visual capabilities. Existing evaluations are often fragmented, focusing on isolated tasks and overlooking the multifaceted nature of audio-visual intelligence. To address this, we introduce \textbf{AVI-Bench}, a cognitively inspired benchmark designed to assess Omni-MLLMs across three stages: perception, understanding, and reasoning. Each stage comprises cross-modal tasks that require simultaneous interpretation of visual and audio inputs, enabling fine-grained diagnostics of model strengths and weaknesses. To further explore models’ robustness against unfamiliar sensory inputs, we propose \textbf{AVI-Bench-PriSe}, an extension targeting the ``primitive sensation'' of Omni-MLLMs on unfamiliar-domain audio-visual inputs with low-semantic stimuli, thereby probing their generalization beyond commonly used general-domain training data. Through comprehensive experiments on both open- and closed-source models, AVI-Bench reveals critical limitations and bottlenecks of current Omni-MLLMs. Building on these insights, we present a four-level taxonomy for classifying the audio-visual intelligence. Our work provides the community with a principled evaluation framework that not only benchmarks performance but also guides future development toward more robust, adaptive, and human-aligned audio-visual intelligence.
Primary Area: datasets and benchmarks
Submission Number: 3278
Loading