Blind Speech Separation Through Direction of Arrival Estimation Using Deep Neural Networks with a Flexibility on the Number of Speakers

Abstract: This paper presents a complete framework for blind multi-speaker separation in reverberant environments using neural networks while being flexible on the number of sound sources. Under the W-disjoint orthogonality (WDO) hypothesis of speech signals, with the proposed framework, we first estimate the direction of arrival (DoA) of each dominant speech signal in each time-frequency bin using a deep neural network. While existing state-of-the-art methods use the masks associated with the different DoAs directly in the separation, we propose to use them as input features of a second deep neural network to estimate refined separation masks. Each speaker's signal is then separated using an estimated Generalized Eigenvalue (GEV) Wiener filter. This approach reduces distortion, interference, and artifacts. We assessed our framework through numerical experiments on a simulated dataset with a comprehensive analysis and on a real dataset acquired from recorded spatial room impulse responses to check if the framework generalizes well to actual cases. Numerical experiments show that our contribution outperforms the state-of-the-art with almost <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$3dB$</tex> to <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$6dB$</tex> in terms of signal to distortion ratio (SDR) and an improvement of 22% to 26% in terms of word error rate using wav2vec2 ASR model.
0 Replies
Loading