Keywords: Chain of Thoughts, Spatial Reasoning
Abstract: Spatial reasoning is fundamental to auditory perception, yet current audio large
language models (ALLMs) largely rely on unstructured binaural cues and single-
step inference. This limits both perceptual accuracy in direction and distance
estimation and the capacity for interpretable reasoning. Recent work such as BAT
demonstrates spatial QA with binaural audio, but its reliance on coarse categorical
labels (left, right, up, down) and the absence of explicit geometric supervision
constrain resolution and robustness. We introduce the $\textbf{Spatial-Acoustic Geometry
Encoder (SAGE}$), a geometry-aware audio encoder that aligns binaural acoustic
features with 3D spatial structure using panoramic depth images and room-impulse
responses at training time, while requiring only audio at inference. Building on this
representation, we present $\textbf{OWL}$, an ALLM that integrates $\textbf{SAGE}$ with a spatially
grounded chain-of-thought to rationalize over direction-of-arrivals (DoA) and
distance estimates. Through curriculum learning from perceptual QA to multi-step
reasoning, $\textbf{OWL}$ supports o’clock-level azimuth and DoA
estimation. To enable large-scale training and evaluation, we construct and release $\textbf{BiDepth}$,
a dataset of over one million QA pairs combining binaural audio with panoramic
depth images and room impulse responses across both in-room and out-of-room scenarios. Across two benchmark datasets, our new $\textbf{BiDepth}$ and the public SpatialSoundQA, $\textbf{OWL}$ reduces mean DoA error by $\textbf{11$^{\circ}$}$ through $\textbf{SAGE}$
and improves spatial reasoning QA accuracy by up to $\textbf{25}$% over BAT. Our dataset and code are available at: https://anonymous.4open.science/r/OWL-ICLR-26/
Primary Area: foundation or frontier models, including LLMs
Submission Number: 3514
Loading