Towards Zero-shot Learning for End-to-end Cross-modal Translation Models

Jichen Yang; Kai Fan; Minpeng Liao; Boxing Chen; Zhongqiang Huang

Towards Zero-shot Learning for End-to-end Cross-modal Translation Models

Jichen Yang, Kai Fan, Minpeng Liao, Boxing Chen, Zhongqiang Huang

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 FindingsEveryoneRevisionsBibTeX

Submission Type: Regular Short Paper

Submission Track: Speech and Multimodality

Submission Track 2: Machine Translation

Keywords: Zero-Shot, End-to-End, Speech Translation

Abstract: One of the main problems in speech translation is the mismatches between different modalities. The second problem, scarcity of parallel data covering multiple modalities, means that the end-to-end multi-modal models tend to perform worse than cascade models, although there are exceptions under favorable conditions. To address these problems, we propose an end-to-end zero-shot speech translation model, connecting two pre-trained uni-modality modules via word rotator's distance. The model retains the ability of zero-shot, which is like cascade models, and also can be trained in an end-to-end style to avoid error propagation. Our comprehensive experiments on the MuST-C benchmarks show that our end-to-end zero-shot approach performs better than or as well as those of the CTC-based cascade models and that our end-to-end model with supervised training also matches the latest baselines.

Submission Number: 1701

Loading