AV-Odyssey Bench: From Fundamental Audio Perception to Audio-Visual Understanding

Kaixiong Gong; Kaituo Feng; Bohao Li; Yibing Wang; Mofan Cheng; Shijia Yang; Jiaming Han; Benyou Wang; Yutong Bai; Zhuoran Yang; Xiangyu Yue

AV-Odyssey Bench: From Fundamental Audio Perception to Audio-Visual Understanding

Kaixiong Gong, Kaituo Feng, Bohao Li, Yibing Wang, Mofan Cheng, Shijia Yang, Jiaming Han, Benyou Wang, Yutong Bai, Zhuoran Yang, Xiangyu Yue

18 Sept 2025 (modified: 21 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Audio-Visual Understanding, Multi-modal Large Language Models

Abstract: Recent multimodal large language models (MLLMs), such as GPT-4o, Gemini 1.5/2.5 Pro, and Reka Core, have advanced audio-visual reasoning capabilities, achieving strong performance in tasks like cross-modal understanding and generation. However, our \textbf{DeafTest} uncovers unanticipated failures: most of the state-of-the-art MLLMs struggle with very simple audio tasks, such as \textit{distinguishing louder sounds} or \textit{sound counting}. This raises a fundamental question—does a deficiency in low-level audio perception constrain higher-level audio-visual reasoning? To address this, we introduce \textbf{AV-Odyssey Bench}—a comprehensive benchmark of 4,555 meticulously designed problems that integrate text, audio, and visual modalities. Each task requires models to unify cross-modal reasoning, leveraging synchronized audio-visual cues to infer solutions. By structuring questions as multiple-choice, we ensure objective, reproducible evaluations without reliance on subjective human or LLM-based judgments. Through comprehensive benchmarking of closed-source and open-source models, we showcase: (i) current MLLMs lack robust audio-visual integration ability and (ii) performance on DeafTest (Pearson’s $r = 0.945$) strongly correlates with AV-Odyssey accuracy. These findings not only challenge prevailing assumptions about the “multimodal proficiency” of leading models, but also highlight the importance of fundamental audio perception as a bottleneck for audio-visual reasoning. We believe that our results provide concrete guidance for future research in future dataset design, alignment strategies, and architectures toward truly integrated audio-visual understanding.

Primary Area: datasets and benchmarks

Submission Number: 12268

Loading