RSA-Bench: Benchmarking Audio Large Models in Real-World Acoustic Scenarios

RSA-Bench: Benchmarking Audio Large Models in Real-World Acoustic Scenarios

ACL ARR 2026 January Submission3551 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Audio Large Models, Robustness Benchmarking, Real-World Acoustic Scenarios

Abstract: While Audio Large Models (ALLMs) have achieved remarkable proficiency, their robustness remains brittle in real-world deployment. Existing evaluations largely rely on synthetic Gaussian noise or simplistic single-source interference, failing to capture the intricate, multi-layered acoustic dynamics—or "Acoustic Ecology"—that characterize authentic physical environments. To bridge this ecological gap, we introduce **RSA-Bench**, a comprehensive robustness benchmark designed to stress-test ALLMs through high-fidelity auditory scene simulations. Unlike traditional methods, we construct evaluation samples by naturally superimposing diverse environmental soundscapes—spanning *Pasture*, *Extreme Weather*, *Classroom*, and *Outdoors*—onto clean speech signals across a spectrum of interference intensities. By evaluating models on six core tasks ranging from fundamental perception to complex reasoning, our study unveils three macro-level insights: **(I) The Perception-Cognition Gap:** Models maintain relative resilience in low-level recognition but suffer a **functional collapse** in high-order reasoning tasks under stress; **(II) Scenario Sensitivity:** "Vocal-like" interference (e.g., background laughter) proves significantly more destructive than mechanical noise, challenging the model's auditory attention mechanisms; and **(III) The Denoising Paradox:** Standard speech enhancement often exacerbates performance degradation, as ALLMs prove highly sensitive to the semantic distortions introduced by denoising artifacts.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking, robustness, speech technologies

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis

Languages Studied: English

Submission Number: 3551

Loading