DNN-Based Geometry-Invariant DOA Estimation With Microphone Positional Encoding and Complexity Gradual Training

Min-Sang Baek, Joon-Hyuk Chang, Israel Cohen

Published: 01 Jan 2025, Last Modified: 02 Dec 2025IEEE Transactions on Audio, Speech and Language ProcessingEveryoneRevisionsCC BY-SA 4.0

Abstract: Recent deep neural network (DNN)-based direction-of-arrival (DOA) estimation methods demonstrate greater robustness compared to conventional methods. However, most DNNs are designed for specific microphone arrays, requiring retraining for different geometries. Although some geometry-invariant methods employ conventional features, they often incur high computational costs and are prone to interference. This paper proposes a geometry-invariant DOA estimation network (GI-DOAEnet). It employs microphone positional encodings (MPEs) that modulate microphone spherical coordinates using sinusoidal functions to provide unique geometric information. Combining MPEs and channel-wise latent features, the network captures spatio-temporal correlations through geometry-invariant modules, ultimately producing spatial spectra. To train GI-DOAEnet effectively with diverse geometries, a complexity gradual training strategy is introduced, integrating deeply supervised curriculum learning with a novel multi-stage geometry learning method. This gradually increases task difficulty by training through varying soft labels and staged transitions from fixed to dynamic geometries. GI-DOAEnet achieves superior performance over baselines in terms of degree error and accuracy across diverse acoustic environments, while reducing FLOPS and inference time by eliminating pair-wise features and employing channel-wise aggregation.

External IDs:doi:10.1109/taslpro.2025.3577336