AU-Harness: An Open-Source Toolkit for Efficient and Unified Evaluation of AudioLLMs

AU-Harness: An Open-Source Toolkit for Efficient and Unified Evaluation of AudioLLMs

ACL ARR 2026 January Submission8196 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: large audio language models, evaluation framework

Abstract: Large Audio Language Models (LALMs) are rapidly advancing, but evaluating them remains challenging due to inefficient toolkits that limit fair comparison and systematic assessment. Existing evaluation frameworks exhibit two critical limitations: (1) slow and inefficient processing pipeline that bottlenecks large-scale studies, (2) the absence of unified and scalable evaluation framework capable of keeping pace with the rapid growth of both LALMs and audio benchmarks. To address these issues, we introduce $\textbf{AU-Harness}$, an efficient and comprehensive evaluation framework for LALMs. Our system achieves a speedup of up to 151\% over existing toolkits through optimized batch processing and parallel execution, enabling large-scale evaluations previously impractical. We provide standardized prompting protocols and flexible configurations for fair model comparison across diverse scenarios. Through evaluation across diverse sets of tasks, we reveal significant gaps in current LALMs. Our findings also highlight a lack of standardization in modalities of user-provided instructions existent across audio benchmarks, which can lead to performance differences of up to 7.2 absolute points on challenging complex instruction following downstream tasks. AU-Harness provides both practical evaluation tools and insights into model limitations, advancing systematic LALM development.

Paper Type: Long

Research Area: Speech Processing and Spoken Language Understanding

Research Area Keywords: Speech Recognition, Text-to-Speech and Spoken Language Understanding, Generation, Resources and Evaluation

Contribution Types: NLP engineering experiment, Reproduction study, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 8196

Loading