Abstract: Despite recent advances in automatic speech recognition (ASR) performance on common languages, a large fraction of the world’s languages remain unsupported. Parameter-efficient fine-tuning (PEFT) methods are used to adapt these models to unseen languages by inserting language-specific modules into the models. To further improve adaptation performance, an ensemble of PEFT models can be formed, where the outputs of the ensemble can be aggregated to create the final prediction, and it has been shown that increasing the diversity of outputs from the ensemble produce can improve results. However, PEFT model ensembles have rarely been studied in the context of ASR despite its advantage of requiring significantly less memory for model storage, and the effect of using diverse PEFT methods to create diversity in PEFT ensemble model outputs remains unexplored. Specifically, it is unclear whether training with different PEFT methods improves diversity more than using the same PEFT method with different random seeds. To verify this, we examine whether a better model ensemble can be formed by combining models adapted by different PEFT methods instead of the same PEFT method. When adapting Whisper to 10 hours of data for each of the 3 unseen languages from Common Voice, results show that our ensemble with diverse PEFT methods consistently outperforms those that use the same PEFT method. Moreover, compared to the common approach of using fully fine-tuned models to form ensembles, our diverse PEFT ensemble can reduce the Word Error Rate from 8.4% to 7.9% while requiring about five times less memory for model storage.
Loading