Automatic Metrics in Natural Language Generation: A Survey of Current Evaluation Practices

Patrícia Schmidtová; Saad Mahamood; Simone Balloccu; Ondrej Dusek; Albert Gatt; Dimitra Gkatzia; David M Howcroft; Ondrej Platek; Adarsa Sivaprasad

Automatic Metrics in Natural Language Generation: A Survey of Current Evaluation Practices

Patrícia Schmidtová, Saad Mahamood, Simone Balloccu, Ondrej Dusek, Albert Gatt, Dimitra Gkatzia, David M Howcroft, Ondrej Platek, Adarsa Sivaprasad

Published: 06 Oct 2024, Last Modified: 12 Nov 2024WiNLP 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: natural language generation, evaluation, survey, metrics

TL;DR: This paper presents a survey of current evaluation practices in NLG and finds that authors frequently use metrics whose validity is being questioned and generally do not comment on the metric choice.

Abstract: Automatic metrics are extensively used to evaluate natural language processing systems. However, there is an increasing focus on how they are used and reported. This work presents a survey on the use of automatic metrics, focusing on natural language generation (NLG) tasks. We report the used metrics, the rationale for choosing them, and how their use is reported. Our findings reveal significant shortcomings, including inappropriate metric usage, lack of implementation details, and missing correlations with human judgments. We conclude with recommendations that we believe authors should follow to enable more rigor within the field.

Submission Number: 64

Loading