Abstract: The rapid adoption of language models (LMs) across diverse applications has raised concerns about their factuality, i.e., their consistency with real-world facts. We introduce VERIFY, an evidence-based evaluation pipeline that measures LMs' factuality in real-world user interactions. VERIFY considers the verifiability of LM-generated content and categorizes content units as supported, unsupported, or undecidable based on Web-retrieved evidence. Importantly, factuality judgment by VERIFY more strongly correlates with human evaluations than existing methods. Using VERIFY, we identify "hallucination prompts," i.e., those that frequently elicit factual errors in LM responses. These prompts form FACTBENCH, a dataset of 1K prompts spanning 150 fine-grained topics and tiered by difficulty. We benchmark widely-used open-weight and proprietary LMs from six families, yielding three key findings: (i) factual precision declines as prompt difficulty increases from Easy to Hard, (ii) Factuality does not necessarily improve with scale; Llama3.1-405B-Instruct performs comparably to or worse than its 70B variant, and (iii) Gemini1.5-Pro shows a notably higher refusal rate, with over-refusal in 25% of cases.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Factuality Evaluation, Fact Checking, Benchmark Curation, Model Evaluation
Contribution Types: Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English
Submission Number: 8059
Loading