EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of SpeakersDownload PDFOpen Website

Published: 01 Jan 2022, Last Modified: 08 May 2023SLT 2022Readers: Everyone
Abstract: In this paper, we present a novel framework that jointly performs three tasks: speaker diarization, speech separation, and speaker counting. Our proposed framework integrates speaker diarization based on end-to-end neural diarization (EEND) models, speaker counting with encoder-decoder based attractors (EDA), and speech separation using Conv-TasNet. In addition, we propose a multiple <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$1 \times 1$</tex> convolutional layer architecture for estimating the separation masks corresponding to a flexible number of speakers and a fusion technique for refining the separated speech signal with obtained speaker diarization information to improve the joint framework. Experiments using the LibriMix dataset show that our proposed method outperforms the single-task baselines in both diarization and separation metrics for fixed and flexible numbers of speakers and improves speaker counting performance for flexible numbers of speakers. All materials will be open-sourced and reproducible in ESPnet toolkit <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> https://github.com/espnet/espnet.
0 Replies

Loading