Abstract: Current evaluation practices in Simultaneous Speech Translation
(SimulST) systems typically involve segmenting the input audio and
corresponding translations, calculating quality and latency metrics
for each segment, and averaging the results. Although this approach
may provide a reliable estimation of translation quality, it can lead
to misleading values of latency metrics due to an inherent assumption
that average latency values are good enough estimators of SimulST
systems' response time. However, our detailed analysis of latency evaluations
for state-of-the-art SimulST systems demonstrates that latency
distributions are often skewed and subject to extreme variations. As
a result, the mean in latency metrics fails to capture these
anomalies, potentially masking the lack of robustness in some systems
and metrics. In this paper, a thorough analysis of the results of
systems submitted to recent editions of the IWSLT simultaneous track
is provided to support our hypothesis and alternative ways to
report latency metrics are proposed in order to provide a better
understanding of SimulST systems' latency.
Paper Type: Long
Research Area: Machine Translation
Research Area Keywords: speech translation, automatic evaluation
Contribution Types: Model analysis & interpretability
Languages Studied: English, German, Japanese, Mandarin Chinese
Submission Number: 4145
Loading