Improving Speech Separation with Knowledge Distilled from Self-supervised Pre-trained Models

Bowen Qu, Chenda Li, Jinfeng Bai, Yanmin Qian

Published: 2022, Last Modified: 10 Nov 2025ISCSLP 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Large-scale self-supervised learning (SSL) models have shown outstanding ability in many speech processing tasks. Most of the SSL models in the literature are trained with datasets where the single-talker utterances dominate. It may not be optimal to directly apply these SSL models to speech separation tasks. Besides, many computational costs in large-scale SSL models increase the overall complexity of the speech separation system. In this paper, we explore the application of pre-trained SSL models in the speech separation task. Instead of using the SSL model directly, we designed an SSL feature predictor to estimate single-talker’s deep features from the speech mixture. The SSL feature predictor is trained with the knowledge distilled from the pre-trained Wav2Vec2.0 model. Our experiments show that the performance of time-domain speech separation can be improved obviously by leveraging the SSL embedding predictor.