Abstract: The task of multi-speaker diarization involves de-tection of number of speakers and segregate the audio seg-ments corresponding to each speaker. Despite the tremendous advancements in deep learning, the problem of multi-speaker diarization is still far from achieving acceptable performance. In this work, we address the problem by first getting the timestamps employing voice activity detection and sliding window techniques. We further extract the Mel-Spectrograms / Mel-frequency Cepstral Coefficients (MFCC). We then train a Long Short-Term Memory (LSTM) network to get the audio embed dings named d-vectors. Subsequently, we employ K-Means and Spectral clustering techniques to segment all the speakers in the given audio file. We evaluate the proposed framework on publically available VoxConverse dataset and report results comparing with similar benchmarks in the existing literature. The proposed model performs better / at par with exisiting techniques despite simpler framework.
External IDs:dblp:conf/icai5/AafaqQKK23
Loading