Keywords: Speech enhancement, semantic information, self-supervised learing, noise robustness
TL;DR: A unified framework noise-invariant representation learning and generative speech enhancement for improved content preservation
Abstract: The importance of semantic information in speech enhancement (SE) has recently been emphasized to improve intelligibility, whereas earlier work primarily focused solely on acoustic perceptual quality. To address this, recent approaches leverage pre-trained self-supervised representations, which have shown strong performance on {discriminative} tasks. However, such representations are less effective for {generative} tasks and, since they are typically trained only on clean data, struggle to fully preserve content under noisy or distorted conditions.
In this work, we aim to bridge this gap by introducing a unified generative SE model, called \textbf{UNISE}, that incorporates noise-invariant representation learning. By jointly learning an encoder using noise-invariant clustering and a generative decoder, our model produces robust speech representations well suited for the SE task.
As a result, UNISE achieves improved linguistic content preservation while maintaining competitive perceptual quality\footnote{Audio samples are available at: \url{https://tinyurl.com/UNISE-ICLR2026}}.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 23439
Loading