Abstract: Deep neural networks (DNN) have been used extensively to achieve impressive results in speech separation. Most of the DNNs implementations to speech separation relies on supervised learning which is data hungry, and success is pegged on availability of large-scale parallel clean-mixed speech pair. This kind of data is always not available since it is difficult to create hence limiting the implementation of supervised learning. Moreover, the implementation of supervised learning in speech separation requires that systems deal with the permutation problem (permutation ambiguity). This places an upper limit of the quality of separated speech that a tool can attain. To avoid the problem of permutation ambiguity, speech separation based on clustering has been proposed by some recent works. However, these clustering techniques still rely on supervised learning and therefore still require quality paired data. To deal with the problem permutation ambiguity and eliminate need for paired training dataset, we propose a fully unsupervised speech separation technique based on clustering of spectrogram points or raw speech blocks. Our technique couples the traditional graph clustering objectives and deep neural networks to achieve speech separation. We start by establishing features of spectrogram points or raw speech blocks using a pre-trained model and consequently use the features in a downstream task of clustering using deep modularization. Through this we are able to identify clusters of spectrogram points or raw speech blocks dominated all speakers in a mixed speech. We perform extensive evaluation of the proposed technique and show that it outperforms state of the art tools included in the study.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Brian_Kingsbury1
Submission Number: 2463
Loading