FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation

Farima Fatahi Bayat; Lechen Zhang; Sheza Munir; Lu Wang

FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation

Farima Fatahi Bayat, Lechen Zhang, Sheza Munir, Lu Wang

28 Sept 2024 (modified: 16 Feb 2025)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Factuality Evaluation Benchmark, Factuality Evaluation Techniques, LLM Evaluation

TL;DR: We curate a benchmark from in-the-wild user-model interactions that evaluates language models' factuality in diverse scenarios.

Abstract: Language models (LMs) are widely used by an increasing number of users, underscoring the challenge of maintaining factual accuracy across a broad range of topics. We present VERIFY (Verification and Evidence RetrIeval for FactualitY evaluation), a pipeline to evaluate LMs’ factual accuracy in real-world user interactions. VERIFY considers the verifiability of LM-generated content and categorizes content units as supported, unsupported, or undecidable based on the retrieved web evidence. Importantly, VERIFY’s factuality judgments correlate better with human evaluations than existing methods. Using VERIFY, we identify “hallucination prompts” across diverse topics–those eliciting the highest rates of incorrect or unverifiable LM responses. These prompts form FACTBENCH, a dataset of 985 prompts across 213 fine-grained topics. Our dataset captures emerging factuality challenges in real-world LM interactions and is regularly updated with new prompts. We benchmark widely-used LMs from GPT, Gemini, and Llama3.1 family on FACTBENCH, yielding the following key findings: (i) Proprietary models exhibit better factuality, improving from Hard to Easy hallucination prompts. (ii) Llama3.1-405B-Instruct shows comparable or lower factual accuracy than Llama3.1-70B-Instruct across all evaluation methods due to its higher subjectivity that leads to more undecidable content. (iii) Gemini1.5-Pro shows a significantly higher refusal rate, with over-refusal in 25% of cases.

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 13591

Loading