MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks

Yadong Niu; TIANZI WANG; Heinrich Dinkel; Xingwei Sun; Jiahao Zhou; Gang Li; Jizhong Liu; Xunying Liu; Jian Luan

MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks

Yadong Niu, TIANZI WANG, Heinrich Dinkel, Xingwei Sun, Jiahao Zhou, Gang Li, Jizhong Liu, Xunying Liu, Jian Luan

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: To effectively distinguish between vague and precise outputs from large audio-language models, we constructed the MECAT benchmark, which comprises a novel, fine-grained annotated dataset and its accompanying discriminative evaluation metric.

Abstract: While large audio-language models have advanced open-ended audio understanding, they still fall short of nuanced human-level comprehension. This gap persists largely because current benchmarks, limited by data annotations and evaluation metrics, fail to reliably distinguish between generic and highly detailed model outputs. To this end, this work introduces MECAT, a Multi-Expert Constructed Benchmark for Fine-Grained Audio Understanding Tasks. Generated via a pipeline that integrates analysis from specialized expert models with Chain-of-Thought large language model reasoning, MECAT provides multi-perspective, fine-grained captions and open-set question-answering pairs. The benchmark is complemented by a novel metric: DATE (Discriminative-Enhanced Audio Text Evaluation). This metric penalizes generic terms and rewards detailed descriptions by combining single-sample semantic similarity with cross-sample discriminability. A comprehensive evaluation of state-of-the-art audio models is also presented, providing new insights into their current capabilities and limitations. The code and data are publicly available at https://github.com/xiaomi-research/mecat.

Lay Summary: While AI models that process audio are getting better, they still struggle to capture the fine details that humans naturally hear. A major reason is that current evaluation methods often fail to distinguish between vague guesses and truly precise descriptions. In this work, we introduce MECAT, a new and highly detailed test suite for audio AI. To evaluate these models accurately, we also designed a new metric, DATE, which actively penalizes generic answers and rewards highly detailed ones. By applying these tools to the latest models, we provide a clearer, more accurate picture of their strengths and weaknesses, paving the way for next-generation AI that comprehends sound at a human level.

Link To Code: https://github.com/xiaomi-research/mecat

Primary Area: General Machine Learning->Evaluation

Keywords: Fine-Grained Audio Understanding, Benchmark, Large Audio-Language Models, Discriminative Evaluation Metric, Audio Caption, Audio Question Answering

Originally Submitted PDF: pdf

Submission Number: 20607

Loading