Listen, Pause, and Reason: Toward Perception-Grounded Hybrid Reasoning for Audio Understanding

Listen, Pause, and Reason: Toward Perception-Grounded Hybrid Reasoning for Audio Understanding

ACL ARR 2026 January Submission7925 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Audio Language Models, Audio Question Answering, Auditory Scene Analysis, Perceptual Error, Latent Reasoning, Group Relative Policy Optimization, Reinforcement Learning

Abstract: Recent Large Audio Language Models have demonstrated impressive capabilities in audio understanding. However, they frequently suffer from **perceptual errors**, while reliable audio reasoning is unattainable without first grounding the model’s perception in structured auditory scenes. Inspired by **Auditory Scene Analysis**, we first introduce **PAQA**, a dataset for Perception-Aware Question Answering. PAQA implements a hierarchical decoupling strategy that separates speech from environmental sound and distinguishes multi speakers, providing explicit perceptual reasoning for training. Building on this, we propose **HyPeR**, a two-stage Hybrid Perception-Reasoning framework. In Stage I, we finetune the model on PAQA to percept acoustic attributes in complex audio. In Stage II, we leverage Group Relative Policy Optimization to refine the model's internal deliberation. We introduce PAUSE tokens to facilitate latent computation during acoustically ambiguous phases and design **Perceptual Consistency Reward** to align reasoning rationales with raw audio. Experiments across key benchmark demonstrate that HyPeR achieves an absolute improvement over the base model, including MMAU-mini (+13.1%), MMAR (+25.5%), and PAQA (+28.2%), with performance comparable to large-scale models, stressing the effectiveness of hybrid perception-grounded reasoning, particularly in noisy and multi-speaker scenarios.

Paper Type: Long

Research Area: Speech Processing and Spoken Language Understanding

Research Area Keywords: Spoken Language Understanding, Machine Learning for NLP, NLP Applications,Question Answering

Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources

Languages Studied: English, Chinese

Submission Number: 7925

Loading