Better Late Than Never: Evaluation of Latency Metrics for Simultaneous Speech-to-Text Translation

Better Late Than Never: Evaluation of Latency Metrics for Simultaneous Speech-to-Text Translation

ACL ARR 2025 July Submission418 Authors

28 Jul 2025 (modified: 08 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Simultaneous speech-to-text translation (SimulST) systems have to balance translation quality with latency--the delay between speech input and the translated output. While quality evaluation is well established, accurate latency measurement remains a challenge. Existing metrics often produce inconsistent or misleading results, especially in the widely used short-form setting, where speech is artificially presegmented. In this paper, we present the first comprehensive analysis of SimulST latency metrics across language pairs, systems, and both short- and long-form regimes. We uncover a structural bias in current metrics related to segmentation that undermines fair and meaningful comparisons. To address this, we introduce YAAL (Yet Another Average Lagging), a refined latency metric that delivers more accurate evaluations in the short-form regime. We extend YAAL to LongYAAL for unsegmented audio and propose SoftSegmenter, a novel resegmentation tool based on word-level alignment. Our experiments show that YAAL and LongYAAL outperform popular latency metrics, while SoftSegmenter enhances alignment quality in long-form evaluation, together enabling more reliable assessments of SimulST systems.

Paper Type: Long

Research Area: Machine Translation

Research Area Keywords: automatic evaluation,speech translation

Contribution Types: Publicly available software and/or pre-trained models, Data analysis

Languages Studied: English,German,Chinese,Czech,Japanese

Previous URL: https://openreview.net/forum?id=R0SpnXd12S

Explanation Of Revisions PDF: pdf

Reassignment Request Area Chair: Yes, I want a different area chair for our submission

Reassignment Request Reviewers: Yes, I want a different set of reviewers

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: Yes

A2 Elaboration: Limitations section

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: Section 4

B2 Discuss The License For Artifacts: Yes

B2 Elaboration: Section 1

B3 Artifact Use Consistent With Intended Use: Yes

B3 Elaboration: For our artifacts, the license is completely open source (as specified in Section 1) and can be used for any purposes. The logs used for the analysis can be used for research purposes.

B4 Data Contains Personally Identifying Info Or Offensive Content: N/A

B5 Documentation Of Artifacts: Yes

B5 Elaboration: Appendix A

B6 Statistics For Data: Yes

B6 Elaboration: Appendix A

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: Limitations section

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: Appendix

C3 Descriptive Statistics: N/A

C3 Elaboration: We analyze the systems' outputs, therefore no multiple runs are needed.

C4 Parameters For Packages: Yes

C4 Elaboration: Appendix

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: Yes

E1 Information About Use Of Ai Assistants: Yes

E1 Elaboration: Limitations

Author Submission Checklist: yes

Submission Number: 418

Loading