Enhancing Speaker Extraction Through Rectifying Target Confusion

Published: 2024, Last Modified: 13 Nov 2025SLT 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Target Speaker Extraction (TSE) aims to extract target speech from mixed audio using clues that identify the target speaker. However, TSE often faces the Target Confusion (TC) problem, where the model extracts the interfering speech instead of the target speech, leading to significant performance degradation. In this paper, we propose a novel model with two branches that enhance target speech extraction by explicitly modeling the interference. Additionally, we propose a Target Confusion Rectification (TCR) method to address the aforementioned TC problem. When the TSE model outputs the wrong speaker, the TCR method performs a rectifying step to ensure the model extracts the correct speaker. Experiments show that under the train-100 subset of Libri2Mix dataset, our proposed method significantly improves the extracting performance in terms of SI-SNRi, PESQ score and extracting accuracy, with that under ‘mix_clean’ subset slightly better than that under ‘mix_both’ subset.
Loading