Keywords: neural machine translation, speech translation, modality gap
TL;DR: We aim to understand the modality gap for speech translation and propose a simple yet effective Cross-modal Regularization with Scheduled Sampling (Cress) method to bridge this gap.
Abstract: How to achieve better end-to-end speech translation (ST) by leveraging (text) machine translation (MT) data? Among various existing techniques, multi-task learning is one of the effective ways to share knowledge between ST and MT, thus additional MT data can help to learn the source-to-target mapping. However, due to the differences between speech and text, there is always a gap between ST and MT. In this paper, we first aim to understand this modality gap from the target-side representation differences. We also link the modality gap to another well-known problem in neural machine translation: exposure bias, where the modality gap is relatively small during training except for some hard cases, but keeps increasing during inference due to the cascading effect. To address these problems, we propose the Cross-modal Regularization with Scheduled Sampling (Cress) method. Specifically, we regularize the output predictions of ST and MT, whose target-side contexts are derived by sampling between ground truth words and self-generated words with a varying probability. Furthermore, to handle the difficult cases with large modality gaps, we introduce token-level adaptive training to assign different training weights to target tokens according to the extent of the modality gap. Experiments and analysis show that our approach effectively bridges the modality gap, and achieves significant improvements over a strong baseline, which establishes new state-of-the-art results in all eight directions of the MuST-C dataset.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)
Supplementary Material: zip
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 4 code implementations](https://www.catalyzex.com/paper/understanding-and-bridging-the-modality-gap/code)
13 Replies
Loading