Which pre-trained model is effective for speech separation ?

17 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Pre-trained model, speech separation, modularization
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: The effectiveness of the use of general audio pre-trained models to generate rep- resentations suitable for speech separation has been explored in a previous study Huang et al. (2022) with the main finding being that they provide minimal benefit compared to features extracted without the models. The study hypothesised that since the general audio pre-trained models were trained with clean audio dataset, they are unable to generalize to noisy and mixed speeches hence not effective in speech separation. This paper investigates this hypothesis by comparing the per- formance of pre-trained model trained on contaminated speeches and that trained on clean ones. We are interested in evaluating whether contamination leads to bet- ter downstream performance. We also investigate if the type of input used to train the pre-trained model impacts the quality of embeddings it generates. To sepa- rate the sources, we propose a fully unsupervised technique of speech separation based on deep modularization. Our findings establish that by injecting noise and reverberation in the training dataset, the pre-trained model generate significantly better embeddings than when clean dataset is used. Further, based on the model presented here, working in short-time Fourier transform (STFT) results in bet- ter features than using time-domain features. The proposed deep modularization speech separation technique can improve SI-SNRi and SDRi by 1.3 and 2.7, re- spectively, when mixtures contain less than four sources and improves the results significantly for many source mixtures.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 908
Loading