Abstract: Long-Form Question Answering (LFQA) refers to generating in-depth, paragraph-level responses to open-ended questions, thus posing a great challenge for evaluation, considering the free format. Previous benchmarks for LFQA evaluation lack references and are constrained by a medium size and limited topics, thus reducing their reliability. To address this gap, we propose a well-constructed, multilingual, and reference-based benchmark named LFQA-E, aiming to rigorously assess the performance of automatic evaluation metrics for LFQA. LFQA-E consists of 1625 questions and 7649 comparisons, covering 15 topics. It is derived from various sources, including online questions and examination questions, designed to test the comprehensive ability of the evaluation metrics. We evaluate 5 types of evaluation metrics, up to $15$ specific metrics using LFQA-E. The results reveal that none of the current automatic evaluation metrics show comparable performance with humans, indicating that they cannot capture the dense information contained in long-form responses well. In addition, we provide a detailed analysis of the reasons why automatic evaluation metrics fail when evaluating LFQA and the generalization ability of these metrics.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: LFQA, Benchmark, Evaluation
Contribution Types: Data resources, Data analysis
Languages Studied: English, Chinese
Submission Number: 5089
Loading