Abstract: Speech and language models are advancing towards universality. A single model can now handle translations across 200 languages and transcriptions for over 100 languages. Universal models simplify development, deployment, and importantly, transfer knowledge to less-resourced languages or modes. This paper introduces M2BART, a streamlined multilingual and multimodal framework for encoderdecoder models. It employs a self-supervised speech tokenizer, bridging speech and text, and is pre-trained with a unified objective for both unimodal and multimodal, unsupervised and supervised data. When tested on Spanish-to-English and English-to-Hokkien translations, M2BART consistently surpassed competitors. We also showcase an innovative translation model enabling zero-shot transfers even without labeled data.
Loading