MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix

Ziyang Ma; Yinghao Ma; Yanqiao Zhu; Chen Yang; Yi-Wen Chao; Ruiyang Xu; Wenxi Chen; Yuanzhe Chen; Zhuo Chen; Jian Cong; Kai Li; Keliang Li; Siyou Li; Xinfeng Li; Xiquan Li; Zheng Lian; Yuzhe Liang; Minghao Liu; Zhikang Niu; tianrui wang; Yuping Wang; Yuxuan Wang; Yihao Wu; Guanrou Yang; Jianwei Yu; Ruibin Yuan; Zhisheng Zheng; Ziya Zhou; Haina Zhu; Wei Xue; Emmanouil Benetos; Kai Yu; EngSiong Chng; Xie Chen

MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix

Published: 18 Sept 2025, Last Modified: 24 Jan 2026NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Benchmark, Audio Understanding, Audio Reasoning, Large Audio Language Model

TL;DR: We introduce MMAR, a new benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs) across massive multi-disciplinary tasks.

Abstract: We introduce MMAR, a new benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs) across massive multi-disciplinary tasks. MMAR comprises 1,000 meticulously curated audio-question-answer triplets, collected from real-world internet videos and refined through iterative error corrections and quality checks to ensure high quality. Unlike existing benchmarks that are limited to specific domains of sound, music, or speech, MMAR extends them to a broad spectrum of real-world audio scenarios, including mixed-modality combinations of sound, music, and speech. Each question in MMAR is hierarchically categorized across four reasoning layers: Signal, Perception, Semantic, and Cultural, with additional sub-categories within each layer to reflect task diversity and complexity. To further foster research in this area, we annotate every question with a Chain-of-Thought (CoT) rationale to promote future advancements in audio reasoning. Each item in the benchmark demands multi-step deep reasoning beyond surface-level understanding. Moreover, a part of the questions requires graduate-level perceptual and domain-specific knowledge, elevating the benchmark's difficulty and depth. We evaluate MMAR using a broad set of models, including Large Audio-Language Models (LALMs), Large Audio Reasoning Models (LARMs), Omni Language Models (OLMs), Large Language Models (LLMs), and Large Reasoning Models (LRMs), with audio caption inputs. The performance of these models on MMAR highlights the benchmark's challenging nature, and our analysis further reveals critical limitations of understanding and reasoning capabilities among current models. These findings underscore the urgent need for greater research attention in audio-language reasoning, including both data and algorithm innovation. We hope MMAR will serve as a catalyst for future advances in this important but little-explored area.

Croissant File: json

Dataset URL: https://huggingface.co/datasets/BoJack/MMAR

Code URL: https://github.com/ddlBoJack/MMAR

Primary Area: Applications of Datasets & Benchmarks for in speech and audio

Submission Number: 34

Loading

MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix

Ziyang Ma, Yinghao Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, Ruiyang Xu, Wenxi Chen, Yuanzhe Chen, Zhuo Chen, Jian Cong, Kai Li, Keliang Li, Siyou Li, Xinfeng Li, Xiquan Li, Zheng Lian, Yuzhe Liang, Minghao Liu, Zhikang Niu, tianrui wang et al. (14 additional authors not shown)