Long-Form End-To-End Speech Translation VIA Latent Alignment Segmentation

Peter Polák, Ondrej Bojar

Published: 2024, Last Modified: 20 May 2025SLT 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Contemporary datasets provide an oracle segmentation into sentences based on human-annotated transcripts and translations. However, the segmentation into sentences is not available in the real world. Current speech segmentation approaches offer poor segmentation quality in the low-latency regime. This paper proposes a novel segmentation approach for a low-latency end-to-end speech translation. We leverage an existing speech translation encoder-decoder architecture with speech translation CTC (ST CTC) and show that the latent alignments produced by ST CTC can guide the segmentation. To the best of our knowledge, our method is the first that allows an actual long-form end-to-end simultaneous speech translation, as one neural model translates and segments simultaneously. On a diverse set of language pairs and in- and out-of-domain data, we show that the proposed approach outperforms current state-of-the-art segmentation methods at no additional computational cost.