LCMA-SRT: Language-Conditional Mixture-of-Experts Adapters for Joint Multilingual Speech Recognition and Translation
Keywords: ASR, AST, MOE, CR-CTC, Multilingual
Abstract: Neural transducers provide an alignment-free framework for joint automatic speech recognition (ASR) and speech translation (ST). Hierarchical transducer architectures further improve multilingual speech-to-text modeling by stacking a translation-focused encoder on top of an ASR encoder to better handle reordering. However, scaling hierarchical transducers to multilingual many-to-many settings remains challenging: fully shared models often suffer from negative transfer and unstable target-language generation, while training separate models per direction is computationally prohibitive. We propose LCMA-SRT (Language-Conditional Mixture-of-Experts Adapters for Speech Recognition and Translation), which augments a hierarchical transducer with language-conditional Mixture-of-Experts (MoE) adapters. A source-conditioned MoE adapter (SRC-MoE) routes using the source-language embedding to improve acoustic–phonetic modeling and reduce cross-language interference for ASR. A target-conditioned MoE adapter (TGT-MoE) routes using the desired target language to guide reordering and lexical selection and to mitigate cross-target interference in many-to-many ST. Experiments on Europarl-ST (9 languages, 72 directions) show that LCMA-SRT improves both ASR and ST within a single joint model, reducing average WER and increasing BLEU and COMET over strong hierarchical transducer baselines. We release our codes and models at \url{https://anonymous.4open.science/r/LCMA-SRT}.
Paper Type: Long
Research Area: Speech Processing and Spoken Language Understanding
Research Area Keywords: automatic speech recognition, spoken language translation
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English, French, German, Italian, Spanish, Portuguese, Polish, Romanian, Dutch
Submission Number: 3016
Loading