A Good Start is Half the Battle Won: Unsupervised Pre-training for Low Resource Children's Speech Recognition for an Interactive Reading Companion

Abhinav Misra, Anastassia Loukina, Beata Beigman Klebanov, Binod Gyawali, Klaus Zechner

Published: 01 Jan 2021, Last Modified: 22 Jul 2024AIED (1) 2021EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Children’s speech recognition is a challenging task because of the inherent speech production characteristics of children’s articulatory structure as well as their linguistic usage. In the context of developing automated reading companions, the problem is compounded by lack of training data. Most of the available data is recorded under clean and controlled conditions leading to a performance degradation in presence of uncontrolled and realistic acoustic environments. In this study, we address these challenges by leveraging a publicly available large unlabeled read speech corpus to learn generalized audio representations. These learned representations are then employed to augment the features used for training the acoustic model of limited in-domain children’s speech. The representations are learned via a deep convolutional architecture optimized on a noise contrastive binary classification task to distinguish a true future audio sample from negatives. We obtain upto 24.87% relative improvement in the Word Error Rate (WER) of our speech recognition system using these generalized audio embeddings and show the effectiveness of using a pre-trained model when training data is limited.