everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
Language models (LMs) are widely used by an increasing number of users, underscoring the challenge of maintaining factual accuracy across a broad range of topics. We present VERIFY (Verification and Evidence RetrIeval for FactualitY evaluation), a pipeline to evaluate LMs’ factual accuracy in real-world user interactions. VERIFY considers the verifiability of LM-generated content and categorizes content units as supported, unsupported, or undecidable based on the retrieved web evidence. Importantly, VERIFY’s factuality judgments correlate better with human evaluations than existing methods. Using VERIFY, we identify “hallucination prompts” across diverse topics–those eliciting the highest rates of incorrect or unverifiable LM responses. These prompts form FACTBENCH, a dataset of 985 prompts across 213 fine-grained topics. Our dataset captures emerging factuality challenges in real-world LM interactions and is regularly updated with new prompts. We benchmark widely-used LMs from GPT, Gemini, and Llama3.1 family on FACTBENCH, yielding the following key findings: (i) Proprietary models exhibit better factuality, improving from Hard to Easy hallucination prompts. (ii) Llama3.1-405B-Instruct shows comparable or lower factual accuracy than Llama3.1-70B-Instruct across all evaluation methods due to its higher subjectivity that leads to more undecidable content. (iii) Gemini1.5-Pro shows a significantly higher refusal rate, with over-refusal in 25% of cases.