Abstract: The correlation between NLG automatic evaluation metrics and human evaluation is the most critical criterion for assessing the capability of an evaluation metric. However, different grouping methods and choices of correlation coefficients result in at least 12 types of correlation measures. For a long time, little has been known about their characteristics. Therefore, this paper illustrates the relationships between different correlation measures and demonstrates how the degree of data discretization affects their values through statistical simulations. Additionally, we designed algorithms to evaluate the discriminative power and ranking consistency of 12 correlation measures using empirical data from 6 datasets and 32 evaluation metrics, uncovering many interesting conclusions.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: statistical testing for evaluation, metrics, evaluation methodologies, evaluation
Contribution Types: Data analysis
Languages Studied: English
Submission Number: 4950
Loading