Keywords: Large Audio Language Model, Audio Understanding
Abstract: While Large Audio Language Models (LALMs) have demonstrated remarkable capabilities in audio understanding tasks, their performance degrades sharply in complex acoustic scenes, revealing a fundamental limitation in their perceptual grounding. In this work, we first identify a critical failure mode that exposes this limitation: state-of-the-art LALMs paradoxically struggle more with simple evidence-extraction tasks than with complex reasoning ones. We diagnose this as a breakdown in acoustic evidence grounding, a problem rooted in systemic information loss during feature encoding and fusion. To address this, we introduce EvA (Evidence-First Audio), a new paradigm that prioritizes maximizing the fidelity of acoustic evidence. EvA's dual-encoder architecture combines Whisper with CED-Base, a ViT-based general audio encoder, and pioneers a structure-preserving, two-stage fusion process. First, it enriches evidence by hierarchically aggregating multi-level features from within the CED-Base encoder. Second, it integrates this representation with Whisper's output via a time-aligned, inject-and-add mechanism that guarantees perfect temporal integrity. To facilitate training for this paradigm, we co-develop EvA-Perception, a large-scale open-source dataset with high-temporal-precision annotations. Our resulting model establishes a new open-source state-of-the-art on multiple challenging benchmarks, including MMAU, MMAR, and MMSU. Crucially, EvA achieves its most significant gains on perception-heavy subsets, validating our hypothesis that addressing the evidence bottleneck is key to unlocking the next level of audio understanding.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 24783
Loading