Training Simultaneous Speech Translation with Robust and Random Wait-k-Tokens Strategy

Linlin Zhang; Kai Fan; Jiajun Bu; Zhongqiang Huang

Training Simultaneous Speech Translation with Robust and Random Wait-k-Tokens Strategy

Linlin Zhang, Kai Fan, Jiajun Bu, Zhongqiang Huang

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 MainEveryoneRevisionsBibTeX

Submission Type: Regular Long Paper

Submission Track: Speech and Multimodality

Submission Track 2: Machine Translation

Keywords: Simultaneous Speech Translation, Robust and Random Wait-k, Cross-modal alignment

Abstract: Simultaneous Speech Translation (SimulST) is a task focused on ensuring high-quality translation of speech in low-latency situations. Despite this, the modality gap (\emph{e.g.}, unknown word boundaries) between audio and text presents a challenge. This gap hinders the effective application of policies from simultaneous text translation (SimulMT) and compromises the performance of offline speech translation. To address this issue, we first leverage the Montreal Forced Aligner (MFA) and utilize audio transcription pairs in pre-training the acoustic encoder, and introduce a token-level cross-modal alignment that allows the wait-$k$ policy from SimulMT to better adapt to SimulST. This token-level boundary alignment simplifies the decision-making process for predicting read/write actions, as if the decoder were directly processing text tokens. Subsequently, to optimize the SimulST task, we propose a robust and random wait-$k$-tokens strategy. This strategy allows a single model to meet various latency requirements and minimizes error accumulation of boundary alignment during inference. Our experiments on the MuST-C dataset show that our method achieves better trade-off between translation quality and latency.

Submission Number: 2522

Loading