Evaluating Text Generation: Comparative Analysis and Entropy-Weighted BLEU

Corentin Domergue, Boran Lucas

22 Mar 2023OpenReview Archive Direct UploadReaders: Everyone

Abstract: Automated Story Generation (ASG) is a vital area in Natural Language Processing (NLP) that requires reliable evaluation methods. In this article, we examine various techniques and metrics for evaluating automated text generation quality, such as pretrained language models and associated metrics like BLEU, ROUGE, and BERTScore. We analyze the performance of different Automatic Evaluation Metrics (AEM) on the MANS and HANNA datasets, considering various text generators and human judgments. We introduce a new variant of the BLEU metric, called Entropy-Weighted BLEU, which is particularly useful for shorter texts but has limitations. Our study highlights the importance of selecting the right metrics for evaluating text generation quality and emphasizesthe need for continuous exploration of new evaluation methods. Our code is available on Github.

0 Replies