Multi-channel Speaker Counting for EEND-VC-based Speaker Diarization on Multi-domain Conversation

Published: 2025, Last Modified: 02 Feb 2026ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: This paper proposes a speaker counting scheme using multichannel microphones for end-to-end neural diarization with a vector clustering (EEND-VC) speaker diarization pipeline. The EEND-VC-based system estimates the number of speakers by clustering speaker embeddings from small chunks. However, conventional speaker counting struggles in short sessions with limited available embeddings. We address this issue by leveraging the most possible embeddings from multichannel signals to increase the number of embeddings. One challenge in using embeddings across channels is the biases caused by channel differences. To mitigate this issue, we extend the EEND-VC pipeline with two modifications: (1) applying speech enhancement before extracting speaker embedding to capture the speaker characteristics even from short chunks and (2) grouping microphones based on inter-channel correlation to perform speaker counting within each group and then aggregating these channel-wise results. The proposed scheme was integrated into our CHiME-8 diarization pipeline, achieving superior speaker counting accuracy compared to the CHiME-8 baseline, with 54.2% and 61.4% improvements in the development and evaluation sets, respectively.
Loading