Abstract: This paper conducts a benchmark of most classical natural language generation metrics on a translation task. We evaluated the correlation between values of similarity between a reference translation and a candidate, and a human scoring of this candidate. We then established a ranking of the metrics relatively to this score, which is common to what we could have expected. Finally, we propose a way to aggregate different metrics as a vote of expert through Kemeny consensus, to be able to grasp the best characteristics of each metric, which are to be very good on text-level features (BLEU and ROUGE for instance), and high-level ones (BERTScore). Alas, this ranking is only relevant if the metrics behave differently relatively to another on different tasks, which is not the case here. We made our code available on GitHub at https://github.com/greg2451/aggregating-text-similarity-metrics. It includes a simple way to re-run our experiment on the WMT16 and WMT17 datasets, as well as some code to aggregate metrics with Kemeny consensus.
0 Replies
Loading