Improving Generalization of Norwegian ASR with Limited Linguistic Resources

Per Erik Solberg; Pablo Ortiz; Phoebe Parsons; Torbjørn Svendsen; Giampiero Salvi

Improving Generalization of Norwegian ASR with Limited Linguistic Resources

Per Erik Solberg, Pablo Ortiz, Phoebe Parsons, Torbjørn Svendsen, Giampiero Salvi

Published: 20 Mar 2023, Last Modified: 17 Apr 2023NoDaLiDa 2023Readers: Everyone

Keywords: automatic speech recognition, end-to-end, low resource, wav2vec, whisper, norwegian, dialect

TL;DR: Using more varied speech data leads to better wav2vec2 ASR result generalization without increasing the size of the training set.

Abstract: With large amounts of training data, it is possible to train ASR models that generalize well across speakers and domains. But how do you train robust models when there is a limited amount of available training data? In the experiments reported here, we fine-tuned a pre-trained wav2vec2 ASR model on two transcribed, Norwegian speech datasets, one with parliamentary speech and one with radio recordings, as well as on combinations of the two datasets. We subsequently tested these models on different test sets with planned and unplanned speech and with speakers of various dialects. Our results show that models trained on combinations of the two datasets generalize better to new data than the single-dataset models, even when the length of the training data is the same. Our lexical analysis sheds light on the type of mistakes made by the models and on the importance of consistent standardization when training combined models of this kind.

4 Replies

Loading