Physics-Informed Audio-Geometry-Grid Representation Learning for Universal Sound Source Localization

Min-Sang Baek; Gyeong-Su Kim; DONGHYUN KIM; Joon-Hyuk Chang

Physics-Informed Audio-Geometry-Grid Representation Learning for Universal Sound Source Localization

Min-Sang Baek, Gyeong-Su Kim, DONGHYUN KIM, Joon-Hyuk Chang

Published: 26 Jan 2026, Last Modified: 11 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Sound Source Localization, Geometry-Invariant, Grid-Flexible, Representation Learning, Physics-Informed Design, Learnable Non-uniform DFT, Relative Microphone Positional Encoding

TL;DR: This paper proposes audio-geometry-grid representation learning for grid-flexible and geometry-invariant sound source localization, leveraging learnable non-uniform discrete Fourier transform and relative microphone positional encoding.

Abstract: Sound source localization (SSL) is a fundamental task for spatial audio understanding, yet most deep neural network-based methods are constrained by fixed array geometries and predefined directional grids, limiting generalizability and scalability. To address these issues, we propose _audio-geometry-grid representation learning_ (AGG-RL), a novel framework that jointly learns audio-geometry and grid representations in a shared latent space, enabling both geometry-invariant and grid-flexible SSL. Moreover, to enhance generalizability and interpretability, we introduce two physics-informed components: a _learnable non-uniform discrete Fourier transform_ (LNuDFT), which optimizes the dense allocation of frequency bins in a non-uniform manner to emphasize informative phase regions, and a _relative microphone positional encoding_ (rMPE), which encodes relative microphone coordinates in accordance with the nature of inter-channel time differences. Experiments on synthetic and real datasets demonstrate that AGG-RL achieved superior performance, particularly under unseen conditions. The results highlight the potential of representation learning with physics-informed design towards a universal solution for spatial acoustic scene understanding across diverse scenarios.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 10505

Loading