TRUE: Re-evaluating Factual Consistency EvaluationDownload PDF

Anonymous

16 Jan 2022 (modified: 05 May 2023)ACL ARR 2022 January Blind SubmissionReaders: Everyone
Abstract: Grounded text generation systems often generate text that contains factual inconsistencies, hindering their real-world applicability. Automatically evaluating such inconsistencies may help to alleviate this limitation by accelerating evaluation cycles, filtering inconsistent outputs and annotating large-scale training data. While attracting increasing attention, such evaluation metrics are usually developed and evaluated in silo for a single task or dataset. Moreover, previous meta-evaluation protocols focused on system-level correlations with human annotations, which leave the example-level accuracy of such metrics unclear.In this work, we introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks, manually annotated for factual consistency. Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations, yielding clearer quality measures. Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results, and recommend them as a starting point for future evaluations.
Paper Type: long
0 Replies

Loading