Abstract: Target speaker extraction (TSE) aims to directly extract the desired speech given enrollment utterances of the target speaker. Despite significant progress in recent years, most existing methods remain non-causal and computationally intensive. This paper introduces DSINet, a real-time time-frequency (T-F) domain method that leverages the dynamic speaker information fusion mechanism to estimate the real and imaginary (RI) components of the target speech. This method incorporates the T-F band-split modeling as primary speaker extractor. Moreover, instead of explicitly calculating the target speaker embedding, a dynamic speaker information fusion mechanism is proposed for the efficient utilization of target speaker information within each mixture, guiding the backbone extractor towards the desired speech. Experimental results on the WSJ0-2mix and WHAMR! datasets confirm that the proposed method exhibits remarkable scalability and achieves comparable performance to prominent non-causal methods under different model sizes.
Loading