Abstract: In multichannel speech enhancement (SE) systems, deep neural networks (DNNs) are often utilized to directly estimate the clean speech for effective beamforming. This approach, however, may not generalize adequately to new acoustic or noise conditions. Alternatively, DNNs can indirectly perform SE by predicting the time-frequency masks of speech and noise patterns to assist classic statistical beamformers. Despite being robust, its effectiveness is constrained by the later statistical component relying on certain modeling assumptions, e.g., covariance-based modeling in the minimum-variance-distortionless-response (MVDR) beamformer. In this paper, we propose a novel integration of the two types of methodology, by introducing an intra-MVDR module embedded in the U-Net beamformer, that encompasses the merits of both, i.e., effectiveness and robustness. Experiments show that intra-MVDR leads to improvements that are not achievable by simply enlarging the baseline SE network.
Loading