AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Benchmarking Human-like Audio-Visual Intelligence of Omni-MLLMs
Abstract: Recent advances in Omni-Multimodal Large Language Models (Omni-MLLMs) have enabled strong integration of vision, audio, and language. However, their audio-visual intelligence (AVI) remains insufficiently evaluated due to the lack of systematic and comprehensive benchmarks. We introduce **AVI-Bench**, a cognitively inspired benchmark that evaluates Omni-MLLMs across three stages, perception, understanding, and reasoning, through cross-modal tasks requiring joint audio-visual interpretation. This design enables fine-grained diagnosis of model capabilities and failure modes. To further assess robustness beyond familiar domains, we propose **AVI-Bench-PriSe**, an extension that probes models' primitive audio-visual sensation using unfamiliar, low-semantic stimuli, testing generalization beyond common training distributions. Extensive experiments on both open-source and closed-source models reveal substantial limitations in current Omni-MLLMs. Based on these findings, we present a **four-level AVI taxonomy**. Overall, AVI-Bench provides a principled evaluation framework to guide the development of more robust and generalizable AVI. Project website: https://fudancvl.github.io/AVI-Bench
Lay Summary: When humans watch a movie, the brain naturally integrates visual and auditory information. However, the capability of Omni-MLLMs, AI systems designed to jointly process images, audio, and text, to achieve similar audio-visual integration remains poorly understood. To address this limitation, we introduce AVI-Bench, a benchmark suite structured according to three levels of human cognition: perception, understanding, and reasoning. AVI-Bench-PriSe further extends the benchmark with unfamiliar and abstract stimuli to evaluate primitive sensation capabilities. Extensive experiments on leading open-source and closed-source models reveal substantial deficiencies in current audio-visual intelligence. Based on these findings, we propose a four-level taxonomy of audio-visual intelligence, ranging from primitive sensation to human-like reasoning, which provides a unified evaluation framework for future research.
Link To Code: https://fudancvl.github.io/AVI-Bench
Primary Area: Deep Learning->Large Language Models
Keywords: multimodal benchmark, audiovisual intelligence, multimodal large language models
Originally Submitted PDF: pdf
Submission Number: 5610
Loading