Abstract: The field of speech recognition has seen tremendous advances in the recent past owing to the development of powerful deep learning architectures. However, the closely related fields of speech segmentation and di-arization are still primarily dominated by sophisticated variants of hierarchical clustering algorithms. We propose a powerful adaptation of the state-of-the-art Speech Recognition models for these tasks and demonstrate the effectiveness of our techniques on standard datasets. Our architectures are a combination of Bidirectional Long Short Term Memory (LSTM) Networks, Convolutional Networks, and Fully Connected Networks, trained by Gradient Descent to minimize the Cross Entropy and the Connectionist Temporal Classification (CTC) losses. We adapt the Libri Speech corpus for the task of segmentation and diarization. We obtained comparable results with respect to state-of-the-art in both tasks.
0 Replies
Loading