FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation

FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation

ACL ARR 2025 February Submission8059 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The rapid adoption of language models (LMs) across diverse applications has raised concerns about their factuality, i.e., their consistency with real-world facts. We introduce VERIFY, an evidence-based evaluation pipeline that measures LMs' factuality in real-world user interactions. VERIFY considers the verifiability of LM-generated content and categorizes content units as supported, unsupported, or undecidable based on Web-retrieved evidence. Importantly, factuality judgment by VERIFY more strongly correlates with human evaluations than existing methods. Using VERIFY, we identify "hallucination prompts," i.e., those that frequently elicit factual errors in LM responses. These prompts form FACTBENCH, a dataset of 1K prompts spanning 150 fine-grained topics and tiered by difficulty. We benchmark widely-used open-weight and proprietary LMs from six families, yielding three key findings: (i) factual precision declines as prompt difficulty increases from Easy to Hard, (ii) Factuality does not necessarily improve with scale; Llama3.1-405B-Instruct performs comparably to or worse than its 70B variant, and (iii) Gemini1.5-Pro shows a notably higher refusal rate, with over-refusal in 25% of cases.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Factuality Evaluation, Fact Checking, Benchmark Curation, Model Evaluation

Contribution Types: Publicly available software and/or pre-trained models, Data resources, Data analysis

Languages Studied: English

Submission Number: 8059

Loading