AU-Harness: An Open-Source Toolkit for Holistic Evaluation of AudioLLMs

Hoang H Nguyen; Sidharth Surapaneni; Akshay Kalkunte Suresh; Jash Mehta; Aman Tiwari; Oluwanifemi Bamgbose; Sai Rajeswar; Shruthan Radhakrishna; Sathwik Tejaswi Madhusudhan

AU-Harness: An Open-Source Toolkit for Holistic Evaluation of AudioLLMs

Hoang H Nguyen, Sidharth Surapaneni, Akshay Kalkunte Suresh, Jash Mehta, Aman Tiwari, Oluwanifemi Bamgbose, Sai Rajeswar, Shruthan Radhakrishna, Sathwik Tejaswi Madhusudhan

17 Sept 2025 (modified: 06 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: large audio language models, evaluation framework

TL;DR: open-source, efficient, customizable, holistic evaluation framework for AudioLLMs

Abstract: Large Audio Language Models (LALMs) are rapidly advancing, but evaluating them remains challenging due to inefficient toolkits that limit fair comparison and systematic assessment. Current frameworks suffer from three critical issues: slow processing that bottlenecks large-scale studies, inconsistent prompting that hurts reproducibility, and narrow task coverage that misses important audio reasoning capabilities. We introduce **AU-Harness**, an efficient and comprehensive evaluation framework for LALMs. Our system achieves a speedup of up to 127% over existing toolkits through optimized batch processing and parallel execution, enabling large-scale evaluations previously impractical. We provide standardized prompting protocols and flexible configurations for fair model comparison across diverse scenarios. Additionally, we introduce two new evaluation categories: LLM-Adaptive Diarization for temporal audio understanding and Spoken Language Reasoning for complex audio-based cognitive tasks. Through evaluation across 380+ tasks, we reveal significant gaps in current LALMs, particularly in temporal understanding and complex spoken language reasoning tasks. Our findings also highlight a lack of standardization in modalities of user-provided instructions existent across audio benchmarks, which can lead to performance differences of up to 7.1 absolute points on challenging complex instruction following downstream tasks.

Supplementary Material: zip

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Submission Number: 8173

Loading