Rethinking Text Generation Evaluation: A Unified Evaluation Theory for Reflective and Open-Ended Generation Tasks

ACL ARR 2024 June Submission3214 Authors

15 Jun 2024 (modified: 03 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: With increased accessibility of machine generated texts, the need for their evaluation has also increased. There are two types of text generation task for which evaluation is required. In open-ended generation tasks (OGT), the model generates de novo text without any input on which to base it. Examples include story generation. In reflective generation tasks (RGT), the model output is generated to reflect an input sequence. An example is machine translation. Evaluation of RGTs is well-researched, and typically uses metrics that compare one or more gold-standard references to the model output. Evaluation of OGTs is less well-researched, and reference-based evaluations are more challenging: as the task is not seeking to reflect an input, there are usually no references. In this paper, we propose a theory of evaluation that covers both RGT and OGT evaluation. Based on this theory, we propose an output-oriented reference generation method for OGTs, develop an automatic language quality evaluation method for OGTs, and review previous literature from this new perspective. Our experiments demonstrate the effectiveness of these methods across informal texts, formal texts, and domain-specific texts. We conduct a meta-evaluation to compare existing and proposed metrics, finding that our approach better aligns with human judgement.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Resources and Evaluation
Contribution Types: Publicly available software and/or pre-trained models, Theory
Languages Studied: English
Submission Number: 3214
Loading