Speaker Diarization for Unlimited Number of Speakers Using Dynamic Linear

Published: 2024, Last Modified: 08 Jan 2026ISCSLP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Speaker diarization is the task of determining “who spoke when” in an audio recording, which is of practical importance in various multi-talker scenarios such as inquiries of patients and meetings. Current speaker diarization methods are usually supervised clustering algorithms based on artificial neural networks, which usually fail to tackle an unlimited number of speakers. In this paper, we proposed Dynamic Linear, which can automatically add output nodes and generate new vectors. By switching the output layer in Discriminative Neural Clustering (DNC) to dynamic linear, we proposed a speaker diarization method based on Transformer, which can tackle an unlimited number of speakers. We also modified the training strategy, loss function and decode method to facilitate our proposed method. Experiments show that our proposed speaker diarization method not only enables DNC to handle an unlimited number of speakers, but also achieves a 1.83% performance improvement on the AMI dataset relative to baseline method DNC. In addition, dynamic linear can substitute the output linear layer in any neural clustering algorithm, enabling it to deal with an unlimited number of clusters, closer to general clustering algorithms.
Loading