Keywords: summarization, factuality
Abstract: The propensity of abstractive summarization models to make factual errors has been the subject of significant study, including work on metrics to detect factual errors and annotation of errors in current systems’ outputs. However, the ever-evolving nature of summarization systems, metrics, and annotated benchmarks makes factuality evaluation a moving target, and drawing clear comparisons among metrics has become increasingly difficult. In this work, we aggregate summary factuality error annotations from across nine existing datasets and stratify them according to the underlying summarization model annotated to understand metric performance in scoring state-of-the-art and prior models. To support finer-grained analysis, we unify error types into a single taxonomy based on the function of error word(s) and automatically project each of the datasets’ errors into this shared labeled space. We then contrast five state-of-the-art factuality metrics on this benchmark. Our findings show that metric results on datasets built on pretrained model outputs show significantly different results than on datasets with pre-Transformer models. Furthermore, no one metric is superior in all settings or for all error types, and we provide recommendations for best practices given these insights.
Paper Type: long
Research Area: Summarization
0 Replies
Loading