AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs

Yaoting Wang; Ziyi Zhang; Wenming Tu; Shaoxuan Xu; Wenjie DU; Cheng Liang; Weijun Wang; Yuanchao Li; Guangyao Li; Hao Fei; Yuanchun Li; Henghui Ding; Yunxin Liu

AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs

Yaoting Wang, Ziyi Zhang, Wenming Tu, Shaoxuan Xu, Wenjie DU, Cheng Liang, Weijun Wang, Yuanchao Li, Guangyao Li, Hao Fei, Yuanchun Li, Henghui Ding, Yunxin Liu

09 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: multimodal benchmark, audiovisual intelligence, multimodal large language models

TL;DR: Benchmarking Human-like Audio-Visual Intelligence of Omni-MLLMs

Abstract: Recent breakthroughs in Omni-Multimodal Large Language Models (Omni-MLLMs), such as GPT-4o, have showcased remarkable progress in integrating visual and audio modalities with language, bringing us closer to human-like audio-visual intelligence. However, a critical gap remains: the lack of systematic benchmarks to rigorously evaluate these models’ audio-visual capabilities. Existing evaluations are often fragmented, focusing on isolated tasks and overlooking the multifaceted nature of audio-visual intelligence. To address this, we introduce \textbf{AVI-Bench}, a cognitively inspired benchmark designed to assess Omni-MLLMs across three stages: perception, understanding, and reasoning. Each stage comprises cross-modal tasks that require simultaneous interpretation of visual and audio inputs, enabling fine-grained diagnostics of model strengths and weaknesses. To further explore models’ robustness against unfamiliar sensory inputs, we propose \textbf{AVI-Bench-PriSe}, an extension targeting the ``primitive sensation'' of Omni-MLLMs on unfamiliar-domain audio-visual inputs with low-semantic stimuli, thereby probing their generalization beyond commonly used general-domain training data. Through comprehensive experiments on both open- and closed-source models, AVI-Bench reveals critical limitations and bottlenecks of current Omni-MLLMs. Building on these insights, we present a four-level taxonomy for classifying the audio-visual intelligence. Our work provides the community with a principled evaluation framework that not only benchmarks performance but also guides future development toward more robust, adaptive, and human-aligned audio-visual intelligence.

Primary Area: datasets and benchmarks

Submission Number: 3278

Loading