Keywords: natural language generation, evaluation, survey, metrics
TL;DR: This paper presents a survey of current evaluation practices in NLG and finds that authors frequently use metrics whose validity is being questioned and generally do not comment on the metric choice.
Abstract: Automatic metrics are extensively used to evaluate natural language processing systems.
However, there is an increasing focus on how they are used and reported.
This work presents a survey on the use of automatic metrics, focusing on natural language generation (NLG) tasks. We report the used metrics, the rationale for choosing them, and how their use is reported. Our findings reveal significant shortcomings, including inappropriate metric usage, lack of implementation details, and missing correlations with human judgments. We conclude with recommendations that we believe authors should follow to enable more rigor within the field.
Submission Number: 64
Loading