Abstract: Self-supervised learned models have been found to be very effective for tasks such as automatic speech recognition, speaker identification, and others. However, their utility in speech enhancement systems is yet to be firmly established, and perhaps slightly misunderstood. In this paper, we investigate the uses of SSL representations for single-channel speech enhancement in challenging conditions and establish the impact they can have on the enhancement task. Our constraints are designed around on-device real-time speech enhancement – model being causal, and the compute footprint being small. Additionally, we focus on low SNR conditions where such models struggle to provide good performance.
Loading