Multi-Channel Speech Enhancement Guided by Learning-Based $A$ $Posteriori$ Speech Presence Probability Estimation

Shuai Tao, Pejman Mowlaee, Jesper Rindom Jensen, Mads Græsbøoll Christensen

Published: 01 Jan 2025, Last Modified: 06 Apr 2026IEEE Transactions on Audio, Speech and Language ProcessingEveryoneRevisionsCC BY-SA 4.0

Abstract: In this paper, a new deep neural network (DNN)-guided multi-channel speech enhancement approach is proposed to achieve noise reduction, dereverberation, and speech restoration. Different from the end-to-end methods, a DNN model is employed to estimate the key parameters to guide the beamformer (BF) and post-filter (PF). Since the $a$ $posteriori$ speech presence probability (SPP) can softly decide whether speech is present or absent in the short-time Fourier transform domain, the SPP estimate is derived by the DNN to guide the statistics estimation in both PF and BF stages. In the first stage, the multi-channel SPP (MC-SPP) estimate is used to update the estimates of the power density (PSD) matrices of the noise and clean signals. The steering vector is obtained using the covariance subtraction method to implement the BF. In the second stage, the output from the first stage and the observed signal of the reference microphone are integrated as the input of the second DNN to estimate the single-channel SPP of the observed signal from the reference microphone. With the single-channel SPP estimate, the noise PSD is updated using the minimum mean-squared error method without time frame smoothing. Finally, with the statistics estimate, one commonly used BF and PF are employed to extract the target speech from the observed signal. In this paper, the BF and PF are the minimum variance distortionless response and the log spectral amplitude estimator, respectively. One small DNN model for multi-channel speech enhancement is used to estimate SPP in both two stages which aims to improve the model adoption and effectiveness. The experimental results demonstrate that, compared to the end-to-end approaches, our proposed method achieved a better performance in terms of noise attenuation and speech distortion while maintaining a lower model complexity, measured in terms of both the number of parameters and Multiply-ACcumulate operations per second.

External IDs:doi:10.1109/taslpro.2025.3599782