Audio-only Target Speech Extraction with Dual-Path Band-Split RNN

XJTU 2024 CSUC Submission9 Authors

31 Mar 2024 (modified: 14 Jun 2024)XJTU 2024 CSUC Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: target speech extraction, dual-path mechanism, information disentanglement, multi-task learning
Abstract: Even if more and more deep learning models in the field of target speech extraction (TSE) achieved better performance on certain datasets by continuously refining modules and experimenting with new algorithms, they still remain constrained by generic frameworks and have not been able to propose new task decomposition mechanisms or utilize new information. In this paper, we propose a novel model architecture that focuses on extracting the target speaker and suppressing interfering noise simultaneously, acknowledging the intrinsic similarity in the nature of these tasks. The model is divided into two branches, one for extracting the target speaker’s speech and the other for computing the speech of the interferer, thus controlling the shallow latent features to learn the essence of TSE task. Additionally, we adopt a mechanism similar to self-enrollment, where the latent features of the two branches are cross-fused at each stage of the extraction process, in order to further leverage the results obtained by both branches.
Submission Number: 9