Abstract: Multilingual speech recognition systems often use an input language code in order to prompt the transcription in the target language. However, the spoken language in the input audio may not always match the language code, as often prevalent in multilingual societies. This language mismatch can significantly reduce ASR quality. We present a technique to identify and mitigate this issue. We combine off-the-shelf language-ID and language verification models to determine the language code input to the ASR model. The language verification model acts as a gate that decides when to trust the provided language code or use the output of the language-ID model. We compare these approaches with baselines that include vanilla language-ID based and language-independent ASR models. Our experiments on YouTube, SPRING-INX and FLEURS datasets shows the efficacy of the proposed model especially in the mismatched language code setting.
External IDs:dblp:conf/icassp/KimMABFCRG25
Loading