Abstract: This paper is aimed at benchmarking the correlation of some existing metrics with human scores, in the context of Machine Translation related tasks. Do to so, we focus on untrained measures, in particular BLEU (string-based), InfoLM, TER, DepthScore, METEOR, and BaryScore (embedding-based, text as probability distributions) introduced in [Colombo et al., 2021c].
The data used for this task comes from the Tenth Workshop on Machine Translation (WMT15) and includes pairwise generated sentences and reference sentences associated with human annotation.
To assess the relevance of each aforementioned evaluation metrics, we rely on its closeness to the human annotation in terms of correlation: Spearman, Pearson and Kendall. The main conclusion to be drawn from our numerical results is that Baryscore is the most correlated with human annotation. Code is available on github (https://github.com/DouloSOW/project_4_text_similarity.git).
0 Replies
Loading