EvA: An Evidence-First Audio Understanding Paradigm for LALMs

Xinyuan Xie; Shunian Chen; Zhiheng Liu; Yuhao Zhang; Zhiqiang Lv; Liyin Liang; Benyou Wang

EvA: An Evidence-First Audio Understanding Paradigm for LALMs

Xinyuan Xie, Shunian Chen, Zhiheng Liu, Yuhao Zhang, Zhiqiang Lv, Liyin Liang, Benyou Wang

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Audio Language Model, Audio Understanding

Abstract: While Large Audio Language Models (LALMs) have demonstrated remarkable capabilities in audio understanding tasks, their performance degrades sharply in complex acoustic scenes, revealing a fundamental limitation in their perceptual grounding. In this work, we first identify a critical failure mode that exposes this limitation: state-of-the-art LALMs paradoxically struggle more with simple evidence-extraction tasks than with complex reasoning ones. We diagnose this as a breakdown in acoustic evidence grounding, a problem rooted in systemic information loss during feature encoding and fusion. To address this, we introduce EvA (Evidence-First Audio), a new paradigm that prioritizes maximizing the fidelity of acoustic evidence. EvA's dual-encoder architecture combines Whisper with CED-Base, a ViT-based general audio encoder, and pioneers a structure-preserving, two-stage fusion process. First, it enriches evidence by hierarchically aggregating multi-level features from within the CED-Base encoder. Second, it integrates this representation with Whisper's output via a time-aligned, inject-and-add mechanism that guarantees perfect temporal integrity. To facilitate training for this paradigm, we co-develop EvA-Perception, a large-scale open-source dataset with high-temporal-precision annotations. Our resulting model establishes a new open-source state-of-the-art on multiple challenging benchmarks, including MMAU, MMAR, and MMSU. Crucially, EvA achieves its most significant gains on perception-heavy subsets, validating our hypothesis that addressing the evidence bottleneck is key to unlocking the next level of audio understanding.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 24783

Loading