Adopting Domain-Specific Knowledge in ASR Systems

Published: 09 Mar 2025, Last Modified: 11 Mar 2025MathAI 2025 OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multilingual, Automatic speech recognition, International Phonetic Alphabet, Hierarchical multi-task learning, Few-Shot, Language transfer, Domain shifts
TL;DR: This study improves multilingual ASR in IPA format using linguistic knowledge, multi-task learning, and language vectors. Improves accuracy by 7-10% overall, 20%+ for out-of-domain languages with limited data
Abstract: This study addresses the challenge of enhancing the accuracy and robustness of multilingual automatic speech recognition (ASR) models in the International Phonetic Alphabet (IPA) format. The primary obstacles include accounting for linguistic diversity, pronunciation variability, and the scarcity of high-quality annotated data for numerous languages, which impedes model generalization to unseen languages. Tackle this issue, we propose a novel approach that integrates prior linguistic knowledge into the training process and incorporates auxiliary information into the model architecture with hierarchical multi-task learning approach. The proposed method decomposes the phoneme recognition process into multiple levels of abstraction, enabling the model to better generalize across diverse phonetic systems. Furthermore, we introduce two variants of language vector representations: one derived from acoustic signals and the other from phonetic transcriptions. These representations serve as auxiliary information, particularly beneficial for few-shot recognition scenarios. We evaluated the approach using datasets that include both high-resource and low-resource languages. The pre-trained Wav2vec 2.0 transformer model was employed as the base architecture. As a baseline, the model was fine-tuned solely on the primary task using Connectionist Temporal Classification (CTC) loss, without leveraging auxiliary information. Performance was assessed using Phoneme Error Rate (PER) in both in-domain and out-of-domain scenarios. Experimental results demonstrate that the proposed approach achieves a relative improvement of 7–10% in recognition accuracy across most scenarios. Notably, we observed over 20% improvement for out-of-domain languages when the number of languages in the training dataset was reduced.
Submission Number: 47
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview