Abstract: Automatic story generation is a complex branch of NLP whose evaluation techniques have been less studied than for summarization or data-to-text. In this analysis, we will focus on the relevance of the different existing automatic metrics, both traditional and more recent, to evaluate this type of task. With the help of a dataset annotated by human evaluators, we compare automatic metrics to human metrics, look for correlations between them and observe the performance of automatic metrics in predicting some human metrics. Our results mainly show a high similarity between all automatic metrics and their difficulty in predicting human metrics, even when combined.
0 Replies
Loading