Multi-information-aware speech enhancement through self-supervised learning

Published: 2026, Last Modified: 09 Nov 2025Digit. Signal Process. 2026EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Speech enhancement is a crucial technology aimed at improving the quality and intelligibility of speech signals in noisy environments. Recent advancements in deep neural networks have leveraged abundant clean speech datasets for supervised learning with remarkable results. However, supervised models suffer from poor robustness and generalization due to the scarcity of clean speech data and the complexity of the noise distribution in the real world. In this paper, a self-supervised speech enhancement model, called Multi-Information-Aware Speech Enhancement (MIA-SE), is proposed to address these challenges. A novel self-supervised training strategy is introduced in which denoising is performed on a single input twice, with the first denoiser output being employed as an Implicit Deep Denoiser Prior (IDDP) to supervise the subsequent denoising process. Furthermore, an encoder–decoder denoiser architecture based on a complex ratio masking strategy is incorporated to extract phase and magnitude features simultaneously. To capture sequence context information for improved embedding, transformer modules with multi-head attention mechanisms are integrated within the denoiser. The training process is guided by a newly formulated loss function to ensure successful and effective learning. Experimental results on synthetic and real-world noise databases demonstrate the effectiveness of MIA-SE, particularly in scenarios where paired training data is unavailable.
Loading