Direct Segmentation Models for Streaming Speech Translation

Jorge Civera, Alfons Juan-Císcar, Javier Iranzo-Sánchez

26 Jan 2024OpenReview Archive Direct UploadReaders: Everyone

Abstract: The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) sys- tem followed by a Machine Translation (MT) system. These systems are usually connected by a segmenter that splits the ASR output into, hopefully, semantically self-contained chunks to be fed into the MT system. This is specially challenging in the case of streaming ST, where latency requirements must also be taken into account. This work proposes novel segmenta- tion models for streaming ST that incorporate not only textual, but also acoustic information to decide when the ASR output is split into a chunk. An extensive and thorough experi- mental setup is carried out on the Europarl-ST dataset to prove the contribution of acoustic in- formation to the performance of the segmenta- tion model in terms of BLEU score in a stream- ing ST scenario. Finally, comparative results with previous work also show the superiority of the segmentation models proposed in this work.

0 Replies