TL;DR: To effectively distinguish between vague and precise outputs from large audio-language models, we constructed the MECAT benchmark, which comprises a novel, fine-grained annotated dataset and its accompanying discriminative evaluation metric.
Abstract: While large audio-language models have advanced open-ended audio understanding, they still fall short of nuanced human-level comprehension. This gap persists largely because current benchmarks, limited by data annotations and evaluation metrics, fail to reliably distinguish between generic and highly detailed model outputs. To this end, this work introduces MECAT, a Multi-Expert Constructed
Benchmark for Fine-Grained Audio Understanding Tasks. Generated via a pipeline that integrates analysis from specialized expert models with Chain-of-Thought large language model reasoning, MECAT provides multi-perspective, fine-grained captions and open-set question-answering pairs. The benchmark is complemented by a novel metric: DATE (Discriminative-Enhanced Audio Text Evaluation). This metric penalizes generic terms and rewards detailed descriptions by combining single-sample semantic similarity with cross-sample discriminability. A comprehensive evaluation of state-of-the-art audio models is also presented, providing new insights into their current capabilities and limitations. The code and data are publicly available at https://github.com/xiaomi-research/mecat.
Lay Summary: While AI models that process audio are getting better, they still struggle to capture the fine details that humans naturally hear. A major reason is that current evaluation methods often fail to distinguish between vague guesses and truly precise descriptions. In this work, we introduce MECAT, a new and highly detailed test suite for audio AI. To evaluate these models accurately, we also designed a new metric, DATE, which actively penalizes generic answers and rewards highly detailed ones. By applying these tools to the latest models, we provide a clearer, more accurate picture of their strengths and weaknesses, paving the way for next-generation AI that comprehends sound at a human level.
Link To Code: https://github.com/xiaomi-research/mecat
Primary Area: General Machine Learning->Evaluation
Keywords: Fine-Grained Audio Understanding, Benchmark, Large Audio-Language Models, Discriminative Evaluation Metric, Audio Caption, Audio Question Answering
Originally Submitted PDF: pdf
Submission Number: 20607
Loading