Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries

Anonymous

Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone

Abstract: Current pre-trained models applied for summarization are prone to factual inconsistencies which misrepresent the source text. Thus, evaluating the factual consistency of summaries is necessary to develop better models. However, the optimal human evaluation setup for factual consistency has not been standardized. To address this issue, we crowdsourced evaluations for factual consistency using the rating-based Likert Scale and ranking-based Best-Worst Scaling to determine the factors that affect the reliability of the human evaluation. Our crowdsourced evaluations are conducted on the summaries of CNN-Daily Mail and XSum datasets generated by four state-of-the-art models. Ranking-based Best-Worst Scaling offers a more reliable measure of summary quality across datasets, and the reliability of Likert ratings highly depends on the target dataset and the evaluation design. To improve the reliability, we extend the scale of the Likert rating to make it more flexible and we present a scoring algorithm for Best-Worst Scaling, called value learning. Our crowdsourcing guidelines and evaluation protocols will be publicly available to facilitate future research on factual consistency in summarization.

0 Replies

Loading