Speaker Diarization in Multispeaker and Multilingual Scenarios

Published: 01 Jan 2024, Last Modified: 02 Aug 2025ICCCNT 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Speaker diarization is a process in automatic speech processing that involves segmenting and labeling an audio recording based on distinct speakers. The goal is to identify and differentiate speakers within the audio. A comprehensive approach to multilingual speaker diarization, crucial for understanding and processing speech across diverse languages, was implemented. The Pyannote library has been utilized to develop a custom speaker diarization model and finetune it for the in-house multi lingual conversation dataset. Using PyanNet segmentation model, segmentation of audio recordings was done to identify speech regions. Speaker embeddings are extracted using the ECAPA-TDNN architecture, pretrained on Vox-Celeb data and fine-tuned on a unique multilingual dataset featuring six languages and 100 audio samples, split into training, development and testing sets for rigorous evaluation. Agglomerative clustering was employed and compared the system’s output with manually generated Rich Transcription Time Marked (RTTM) with ground truth speaker segments and labels, alongside ECAPA-TDNN, X-vector MFCC and X-Vector SincNet embedding models. This study advances the understanding of multilingual speaker diarization, offering insights into its potential applications in transcription, voice biometrics, and multilingual voice assistants.
Loading