Physics-Informed Audio-Geometry-Grid Representation Learning for Universal Sound Source Localization
Keywords: Sound Source Localization, Geometry-Invariant, Grid-Flexible, Representation Learning, Physics-Informed Design, Learnable Non-uniform DFT, Relative Microphone Positional Encoding
TL;DR: This paper proposes audio-geometry-grid representation learning for grid-flexible and geometry-invariant sound source localization, leveraging learnable non-uniform discrete Fourier transform and relative microphone positional encoding.
Abstract: Sound source localization (SSL) is a fundamental task for spatial audio understanding, yet most deep neural network-based methods are constrained by fixed array geometries and predefined directional grids, limiting generalizability and scalability. To address these issues, we propose _audio-geometry-grid representation learning_ (AGG-RL), a novel framework that jointly learns audio-geometry and grid representations in a shared latent space, enabling both geometry-invariant and grid-flexible SSL. Moreover, to enhance generalizability and interpretability, we introduce two physics-informed components: a _learnable non-uniform discrete Fourier transform_ (LNuDFT), which optimizes the dense allocation of frequency bins in a non-uniform manner to emphasize informative phase regions, and a _relative microphone positional encoding_ (rMPE), which encodes relative microphone coordinates in accordance with the nature of inter-channel time differences. Experiments on synthetic and real datasets demonstrate that AGG-RL achieved superior performance, particularly under unseen conditions. The results highlight the potential of representation learning with physics-informed design towards a universal solution for spatial acoustic scene understanding across diverse scenarios.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 10505
Loading