A Critical Evaluation of Evaluations for Long-form Question AnsweringDownload PDFOpen Website

28 Sept 2023OpenReview Archive Direct UploadReaders: Everyone
Abstract: Long-form question answering (LFQA) en- ables answering a wide range of questions, but its flexibility poses enormous challenges for evaluation. We perform the first targeted study of the evaluation of long-form answers, cover- ing both human and automatic evaluation prac- tices. We hire domain experts in seven areas to provide preference judgments over pairs of answers, along with free-form justifications for their choices. We present a careful analysis of experts’ evaluation, which focuses on new aspects such as the comprehensiveness of the answer. Next, we examine automatic text gen- eration metrics, finding that no existing met- rics are predictive of human preference judg- ments. However, some metrics correlate with fine-grained aspects of answers (e.g., coher- ence). We encourage future work to move away from a single “overall score” of the answer and adopt a multi-faceted evaluation, targeting as- pects such as factuality and completeness. We publicly release all of our annotations and code to spur future work into LFQA evaluation.
0 Replies

Loading