Long Listening Thoughts: Eliciting Open Auditory Reasoning with Deliberative Perception and Cognitive Refinement
Keywords: Audio Reasoning, Large Audio-Language Models
TL;DR: Propose a dataset for audio reasoning post-training with deliberation perception and refining cognitive behaviors.
Abstract: We present Long Listening Thoughts (LLT), a framework designed to scale open-domain audio reasoning by encouraging deliberative perception and cognitive refinement. To address persistent perceptual errors in Large Audio-Language Models (LALMs), LLT constructs 1.7M reasoning traces grounded in multi-aspect dense descriptions capturing semantic events, acoustic properties, and temporal structure. Through a thought continuation strategy, the framework incorporates System-2 style cognitive processes, such as verification, backtracking, and explicit “re-listening", that iteratively correct perceptual and reasoning mistakes. We fine-tune Qwen2.5-Omni on LLT and observe significant improvements on challenging benchmarks including MMAR and MMSU. Our analysis demonstrates that multi-aspect perceptual grounding enhances reasoning quality, that greater question complexity facilitates more effective SFT scaling, and that incorporating incorrect reasoning traces plays a crucial role in learning cognitive behaviors.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 136
Loading