Keywords: Audio-Visual Understanding, Multi-modal Large Language Models
Abstract: Recent multimodal large language models (MLLMs), such as GPT-4o, Gemini 1.5/2.5 Pro, and Reka Core, have advanced audio-visual reasoning capabilities, achieving strong performance in tasks like cross-modal understanding and generation. However, our \textbf{DeafTest} uncovers unanticipated failures: most of the state-of-the-art MLLMs struggle with very simple audio tasks, such as \textit{distinguishing louder sounds} or \textit{sound counting}. This raises a fundamental question—does a deficiency in low-level audio perception constrain higher-level audio-visual reasoning? To address this, we introduce \textbf{AV-Odyssey Bench}—a comprehensive benchmark of 4,555 meticulously designed problems that integrate text, audio, and visual modalities. Each task requires models to unify cross-modal reasoning, leveraging synchronized audio-visual cues to infer solutions. By structuring questions as multiple-choice, we ensure objective, reproducible evaluations without reliance on subjective human or LLM-based judgments. Through comprehensive benchmarking of closed-source and open-source models, we showcase: (i) current MLLMs lack robust audio-visual integration ability and (ii) performance on DeafTest (Pearson’s $r = 0.945$) strongly correlates with AV-Odyssey accuracy. These findings not only challenge prevailing assumptions about the “multimodal proficiency” of leading models, but also highlight the importance of fundamental audio perception as a bottleneck for audio-visual reasoning. We believe that our results provide concrete guidance for future research in future dataset design, alignment strategies, and architectures toward truly integrated audio-visual understanding.
Primary Area: datasets and benchmarks
Submission Number: 12268
Loading