ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation

Chenyang Le; Yao Qian; Long Zhou; Shujie LIU; Yanmin Qian; Michael Zeng; Xuedong Huang

ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation

Chenyang Le, Yao Qian, Long Zhou, Shujie LIU, Yanmin Qian, Michael Zeng, Xuedong Huang

Published: 21 Sept 2023, Last Modified: 02 Nov 2023NeurIPS 2023 posterEveryoneRevisionsBibTeX

Keywords: end-to-end speech to text translation, cross-modality learning, joint speech and language training

Abstract: Joint speech-language training is challenging due to the large demand for training data and GPU consumption, as well as the modality gap between speech and language. We present ComSL, a speech-language model built atop a composite architecture of public pre-trained speech-only and language-only models and optimized data-efficiently for spoken language tasks. Particularly, we propose to incorporate cross-modality learning into transfer learning and conduct them simultaneously for downstream tasks in a multi-task learning manner. Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks, achieving a new state-of-the-art average BLEU score of 31.5 on the multilingual speech to English text translation task for 21 languages, as measured on the public CoVoST2 evaluation set.

Supplementary Material: zip

Submission Number: 10026

Loading