UFO: a Unified and Flexible Framework for Evaluating Factuality of Large Language Models

ACL ARR 2024 June Submission213 Authors

07 Jun 2024 (modified: 08 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models (LLMs) may generate text that lacks consistency with human knowledge, leading to factual inaccuracies or hallucination. Existing research for evaluating the factuality of LLMs involves extracting fact claims using an LLM and verifying them against a predefined fact source. However, these evaluation metrics are task-specific, and not scalable, and the substitutability of fact sources in different tasks is under-explored. To address these challenges, we categorize four available fact sources: human-written evidence, reference documents, search engine results, and LLM knowledge, along with five text generation tasks containing six representative datasets. Then, we propose UFO, an LLM-based unified and flexible evaluation framework to verify facts against plug-and-play fact sources. We implement six evaluation scenarios based on this framework. Experimental results show that human-written evidence and reference documents are crucial in most QA tasks, but in the news fact generation tasks, introducing human-written evidence leads to a decline in the discriminative power of evaluation. Compared to the LLM knowledge, search engine results are more important in most tasks, but they are less effective in the expert-validated QA task.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: automatic evaluation of datasets, evaluation methodologies, metrics
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 213
Loading