Evaluating Open-QA Evaluation

Published: 26 Sept 2023, Last Modified: 15 Jan 2024NeurIPS 2023 Datasets and Benchmarks PosterEveryoneRevisionsBibTeX
Keywords: Open-QA, Large Language Models, QA Evaluation
TL;DR: This paper explores the reliability of current evaluation methods for Qpen-QA, introduces a new task, QA-Eval, and presents a dataset that aids in developing better automatic evaluators.
Abstract: This study focuses on the evaluation of the Open Question Answering (Open-QA) task, which can directly estimate the factuality of large language models (LLMs). Current automatic evaluation methods have shown limitations, indicating that human evaluation still remains the most reliable approach. We introduce a new task, QA Evaluation (QA-Eval) and the corresponding dataset EVOUNA, designed to assess the accuracy of AI-generated answers in relation to standard answers within Open-QA. Our evaluation of these methods utilizes human-annotated results to measure their performance. Specifically, the work investigates methods that show high correlation with human evaluations, deeming them more reliable. We also discuss the pitfalls of current methods and methods to improve LLM-based evaluators. We believe this new QA-Eval task and corresponding dataset EVOUNA will facilitate the development of more effective automatic evaluation tools and prove valuable for future research in this area. All resources are available at https://github.com/wangcunxiang/QA-Eval and it is under the Apache-2.0 License.
Supplementary Material: pdf
Submission Number: 495
Loading