A Two-Stage LoRA Strategy for Expanding Language Capabilities in Multilingual ASR Models

Chin Yuen Kwok, Hexin Liu, Jia Qi Yip, Sheng Li, Eng Siong Chng

Published: 01 Jan 2025, Last Modified: 23 Jan 2026IEEE Transactions on Audio, Speech and Language ProcessingEveryoneRevisionsCC BY-SA 4.0

Abstract: Adapting multilingual automatic speech recognition (MASR) models to support new languages is crucial for enhancing global communication accessibility. However, extending these models often leads to catastrophic forgetting, impairing their ability to accurately process previously learned languages. To address this, we present a mixture-of-experts (MoE) strategy that employs Low-Rank Adaptation (LoRA) experts, each dedicated to a specific language. Unlike previous approaches that rely on a small neural network for gating, we utilize the language identification (LID) outputs of the MASR model itself to more precisely activate the appropriate LoRA experts. Our method involves training two sets of LoRAs for LID and ASR, respectively: the LID LoRAs adapt the MASR model for LID across both existing and new languages, and the ASR LoRAs consist of multiple language-specific LoRA experts for ASR. During inference, we first employ the LID LoRAs to identify the language, followed by the activation of the corresponding ASR LoRAs based on this identification. Unfortunately, this approach means LID errors also cause incorrect ASR LoRAs activation. We perform language-wise beam search to allow self-correction for such mistakes. When applied to the Whisper model and integrated with ten new languages from the Common Voice dataset, our approach achieves up to 13.8% and 10.4% relative improvements in word error rate (WER) for new and previously learned languages, respectively. This method effectively reduces catastrophic forgetting with less than 3% additional computation overhead compared to the standard LoRA implementation.

External IDs:doi:10.1109/taslpro.2025.3578752