FASST: Fast LLM-based Simultaneous Speech Translation

ACL ARR 2024 June Submission5165 Authors

16 Jun 2024 (modified: 12 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Simultaneous speech translation (SST) takes streaming speech input and generates text translation on the fly. Existing methods either have high latency due to recomputation of input representations, or fall behind of offline ST in translation quality. In this paper, we propose FASST, a fast large language model based method for streaming speech translation. We propose blockwise-causal speech encoding and consistency mask, so that streaming speech input can be encoded incrementally without recomputation. Furthermore, we develop a two-stage training strategy to optimize FASST for simultaneous inference. We evaluate FASST and multiple strong prior models on MuST-C dataset. Experiment results show that FASST achieves the best quality-latency tradeoff. It outperforms the previous best model by an average of 1.5 BLEU under the same latency for English to Spanish translation.
Paper Type: Long
Research Area: Machine Translation
Research Area Keywords: efficient inference for MT, speech translation
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: English, Spanish, German
Submission Number: 5165
Loading