Long Listening Thoughts: Eliciting Open Auditory Reasoning with Deliberative Perception and Cognitive Refinement

Jaeyeon Kim; Chao-Han Huck Yang; Luoyi Zhang; Chan-Jan Hsu; Fernando Ruiloba Portilla; Jinchuan Tian; Shinji Watanabe; Carlos Busso

Long Listening Thoughts: Eliciting Open Auditory Reasoning with Deliberative Perception and Cognitive Refinement

Jaeyeon Kim, Chao-Han Huck Yang, Luoyi Zhang, Chan-Jan Hsu, Fernando Ruiloba Portilla, Jinchuan Tian, Shinji Watanabe, Carlos Busso

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0

Keywords: Audio Reasoning, Large Audio-Language Models

TL;DR: Propose a dataset for audio reasoning post-training with deliberation perception and refining cognitive behaviors.

Abstract: We present Long Listening Thoughts (LLT), a framework designed to scale open-domain audio reasoning by encouraging deliberative perception and cognitive refinement. To address persistent perceptual errors in Large Audio-Language Models (LALMs), LLT constructs 1.7M reasoning traces grounded in multi-aspect dense descriptions capturing semantic events, acoustic properties, and temporal structure. Through a thought continuation strategy, the framework incorporates System-2 style cognitive processes, such as verification, backtracking, and explicit “re-listening", that iteratively correct perceptual and reasoning mistakes. We fine-tune Qwen2.5-Omni on LLT and observe significant improvements on challenging benchmarks including MMAR and MMSU. Our analysis demonstrates that multi-aspect perceptual grounding enhances reasoning quality, that greater question complexity facilitates more effective SFT scaling, and that incorporating incorrect reasoning traces plays a crucial role in learning cognitive behaviors.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 136

Loading