The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models

ICLR 2026 Conference Submission16845 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Spatial Audio, Large Audio-Language Models, Acoustic Scene Analysis, Mixture-of-Experts, Reinforcement Learning
TL;DR: Spatial Understanding in Large Audio-Language Models
Abstract: Existing large audio-language models perceive the world as "mono"—a single stream of audio that ignores the critical spatial dimension ("where") required for universal acoustic scene analysis. To break down this fundamental limitation, we introduce a framework that enables models like Qwen2-Audio to understand and reason about the complex, three-dimensional acoustic world. Our framework achieves this through three core contributions: First, we build a large-scale, synthesized binaural audio dataset to provide the rich spatial cues. Second, we design a novel Mixture-of-Experts (MoE) architecture, where a learnable router directs outputs from parallel semantic and spatial encoders to specialized expert pathways. Finally, we employ a progressive training curriculum, advancing from supervised fine-tuning (SFT) to reinforcement learning via Group Relative Policy Optimization (GRPO), to evolve the model's capabilities from basic perception to advanced reasoning. On our comprehensive benchmark, the model demonstrates a strong capability for spatial understanding. By enabling this spatial perception, our work provides a clear pathway for leveraging the powerful reasoning abilities of large models towards holistic acoustic scene analysis, advancing from one-dimensional semantic recognition to three-dimensional spatial intelligence.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 16845
Loading