SASST: Leveraging Syntax-Aware Chunking and LLMs for Simultaneous Speech Translation

ACL ARR 2025 May Submission4489 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: We present SASST, an end-to-end framework for simultaneous speech translation (SimulST) that unifies segmentation, alignment, and translation within a single decoder-only language model. To address real-time generation timing, we introduce a syntax-aware chunking strategy that segments source speech based on grammatical structure. The model is trained to output either translation tokens or a special token <WAIT>, enabling it to jointly learn when and what to translate under causal constraints. During training, chunk-level alignment and target-side reordering help the model associate syntactic boundaries with fluent target segments. Unlike prior systems that decouple policy and decoding, SASST integrates both into a unified autoregressive generation process. Experiments on the MuST-C En→De benchmark show that SASST achieves higher BLEU at lower latency, outperforming strong baselines by up to +1.4 BLEU in low-delay regimes. These results demonstrate the effectiveness of incorporating syntactic structure in LLM-driven SimulST.
Paper Type: Long
Research Area: Machine Translation
Research Area Keywords: Simultaneous Speech translation, Speech Translation, Machine Translation,Speech Processing and Spoken Language Understanding
Languages Studied: English, German
Submission Number: 4489
Loading