How Close are Automated Metrics to Human Judgment in Machine Translation?

Nicolas Julien, Mats Bererd

21 Mar 2023OpenReview Archive Direct UploadReaders: Everyone

Abstract: The field of Machine Translation (MT) has experienced rapid progress in recent years, with significant advancements in neural-based models and parallel corpora. However, the challenge of developing relevant automated metrics to evaluate MT systems remains a significant obstacle. Despite the widespread use of automated Natural Language Processing (NLP) metrics for this purpose, there is growing concern that these metrics do not always align with human judgment, leading to potential inaccuracies in evaluation. To address this issue, our research paper conducted a benchmark evaluation of various automated NLP metrics at the sentence-level, with a focus on two different approaches: candidate-to-reference and candidate-to-original-sentence, also known as the Quality Estimation (QE) task. Through our evaluation, we found that automated metrics perform well in the former aspect, but there is still significant room for improvement in the latter. Our research highlights the importance of multilingual QE, as it offers a strategic solution to the challenge of collecting labelled data for each language pair. By overcoming this obstacle, multilingual QE can play a crucial role in improving MT models. However, our findings also underscore the need for further research and development in this area, particularly in developing automated metrics that align more closely with human judgment. Ultimately, improving the accuracy and reliability of automated NLP metrics will be essential to advancing the field of MT and realizing the full potential of machine translation technology.

0 Replies