Spoken Language Recognition Using CNN

Published: 01 Jan 2019, Last Modified: 15 May 2025ICIT 2019EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: This paper studies the use of Convolutional Neural Networks (CNNs) to address the problem of spoken language recognition. The solution uses Convolution Neural Networks in order to detect language specific phonemes. It supports 3 languages: German, English and Spanish. LibriVox Recordings were used to prepare the data set. The samples are equally balanced between languages, genders and speakers not to favour any subgroup. Attention has been paid to concentrate more on language properties than any specific voice. The data set has been boosted with data augmentation with transformations such as pitch, speed and noise. The input audio is normalized after which the filter banks are extracted. The pre-processed data is passed to the convolution neural network. The trained model generalizes well which was confirmed against real life content. The fact that audio samples were perfectly distributed among different languages and genders and the fact that speakers present in training set were not in validation set were the prominent reasons to achieve such a high performance on such a simple architecture.
Loading