Listen, Pause, and Reason: Toward Perception-Grounded Hybrid Reasoning for Audio Understanding

ACL ARR 2026 January Submission7925 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Audio Language Models, Audio Question Answering, Auditory Scene Analysis, Perceptual Error, Latent Reasoning, Group Relative Policy Optimization, Reinforcement Learning
Abstract: Recent Large Audio Language Models have demonstrated impressive capabilities in audio understanding. However, they frequently suffer from **perceptual errors**, while reliable audio reasoning is unattainable without first grounding the model’s perception in structured auditory scenes. Inspired by **Auditory Scene Analysis**, we first introduce **PAQA**, a dataset for Perception-Aware Question Answering. PAQA implements a hierarchical decoupling strategy that separates speech from environmental sound and distinguishes multi speakers, providing explicit perceptual reasoning for training. Building on this, we propose **HyPeR**, a two-stage Hybrid Perception-Reasoning framework. In Stage I, we finetune the model on PAQA to percept acoustic attributes in complex audio. In Stage II, we leverage Group Relative Policy Optimization to refine the model's internal deliberation. We introduce PAUSE tokens to facilitate latent computation during acoustically ambiguous phases and design **Perceptual Consistency Reward** to align reasoning rationales with raw audio. Experiments across key benchmark demonstrate that HyPeR achieves an absolute improvement over the base model, including MMAU-mini (+13.1%), MMAR (+25.5%), and PAQA (+28.2%), with performance comparable to large-scale models, stressing the effectiveness of hybrid perception-grounded reasoning, particularly in noisy and multi-speaker scenarios.
Paper Type: Long
Research Area: Speech Processing and Spoken Language Understanding
Research Area Keywords: Spoken Language Understanding, Machine Learning for NLP, NLP Applications,Question Answering
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English, Chinese
Submission Number: 7925
Loading