Better Late Than Never: Evaluation of Latency Metrics for Simultaneous Speech-to-Text Translation

ACL ARR 2025 May Submission4002 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Simultaneous speech-to-text translation (SimulST) systems have to balance translation quality with latency--the delay between speech input and translated output. While quality evaluation is well established, accurately measuring latency remains challenging. Existing metrics often produce inconsistent or misleading results, especially in the widely used short-form setting where speech is artificially pre-segmented. In this paper, we present the first comprehensive analysis of SimulST latency metrics across language pairs, systems, and both short- and long-form regimes. We uncover a structural bias affecting current metrics related to segmentation that undermines fair and meaningful comparisons. To address this, we introduce YAAL (Yet Another Average Lagging), a refined latency metric that delivers more consistent and accurate evaluations in the short-form regime. We extend YAAL to LongYAAL for unsegmented audio streams and propose SoftSegmenter, a novel resegmentation tool based on word-level alignment. Our experiments show that YAAL and LongYAAL outperform popular latency metrics, while SoftSegmenter enhances alignment quality in long-form evaluation, together enabling more reliable assessments of SimulST systems.
Paper Type: Long
Research Area: Machine Translation
Research Area Keywords: automatic evaluation,speech translation
Contribution Types: Publicly available software and/or pre-trained models, Data analysis
Languages Studied: English,German,Chinese,Czech,Japanese
Submission Number: 4002
Loading