Evaluation of Untrained Metrics through Correlation with Human Judgment: the Case of Translation

Alex Tordjman, Alexandrine Lanson

19 Mar 2023OpenReview Archive Direct UploadReaders: Everyone

Abstract: Motivated by the recent development of Natural Laguage Generation (NLG), we give an overview of several untrained metrics used to evaluate NLG algorithms' performances, for a translation task. We use a dataset (WMT16, de-en) composed of pairs of sentences, each of which being labelled with a human score reflecting how similar both sentences are according to human judgment. We compute the correlation between each metric's score and the human reference score, as well as correlations among metrics. Our results show that embedding-based metrics are more correlated with human judgment than string-based metrics; the highest correlation coefficients being obtained for BERTScore. Among metrics, embedding-based metrics are the most correlated with each other.

0 Replies