M2BART: Multilingual and Multimodal Encoder-Decoder Pre-Training for Any-to-Any Machine Translation

Peng-Jen Chen, Bowen Shi, Kelvin Niu, Ann Lee, Wei-Ning Hsu

Published: 01 Jan 2024, Last Modified: 15 May 2025ICASSP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Speech and language models are advancing towards universality. A single model can now handle translations across 200 languages and transcriptions for over 100 languages. Universal models simplify development, deployment, and importantly, transfer knowledge to less-resourced languages or modes. This paper introduces M2BART, a streamlined multilingual and multimodal framework for encoderdecoder models. It employs a self-supervised speech tokenizer, bridging speech and text, and is pre-trained with a unified objective for both unimodal and multimodal, unsupervised and supervised data. When tested on Spanish-to-English and English-to-Hokkien translations, M2BART consistently surpassed competitors. We also showcase an innovative translation model enabling zero-shot transfers even without labeled data.