Multi-speaker DoA Estimation Using Audio and Visual Modality

Published: 2023, Last Modified: 08 Aug 2024Neural Process. Lett. 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Traditionally, direction of arrival (DoA) approaches only use a single audio modality. However, humans naturally locate sound sources through auditory and visual clues. Inspired by this motivation, we adopt audio and visual modalities for DoA estimation, where the video is used as a prominent supplementary modality for sound source localization. Additionally, this paper introduces a novel transformer-based sound source localization framework. We also use self-attention mechanisms to capture temporal dependencies in the multi-channel audio signals. The whole model is trained to map an ideal spatial spectrum with likelihood-based output coding. The framework is evaluated on an available multi-speaker sound source localization dataset and compared against state-of-the-art methods in terms of DoA estimation error and localization accuracy. Experimental results show that the proposed audio-visual multi-speaker DoA estimation method yields improved performance over the baselines.
Loading