Keywords: speech enhancement, semi-supervised learning, mean-teacher, gradient-guide channel attenuation
TL;DR: A semi-supervised speech enhancement network based on mean-teacher framework for speech enhancement.
Abstract: Recent methods for speech enhancement (SE) have generally adopted the supervised learning way and trained the models on synthetic noisy-clean paired speech data. However, when applying the supervised trained SE model to the recordings of real-world scenario, which we call unlabeled data, it will lead to the performance degradation. To improve the generalization performance of SE, we propose a semi-supervised monaural speech enhancement network, SS-SENet, which adopts the mean-teacher (MT) framework with domain adversarial (DA) learning to effectively exploit the unlabeled data. We also propose the Gradient-Guided Channel Attenuation (GGCA) module for suppressing the domain-specific features and enhance domain-invariant one, and Domain Shift-Aware Monitor (DSAM) strategy for dynamically adjusting the attenuation rate in GGCA. Comparing with seven SOTA methods exploiting the unlabeled data, our proposed SS-SENet achieves the best performances at all metrics both on synthetic Reverberant LibriCHiME-5 and LibriMix datasets, and at the critical metric, OVRL, on the real-world CHiME-5 dataset. The results verify that our proposed basic MT-based method is superior to the compared methods based on full supervised or self-supervised learning. It also verifies the effectiveness of our proposed GGCA module and DSAM strategy. The source code is available at \url{https://anonymous.4open.science/r/SS-SENet}.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 16709
Loading