FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees
TL;DR: We introduce FactTest that statistically evaluates whether an LLM can generate correct answers to given questions with provable correctness guarantees.
Abstract: The propensity of large language models (LLMs) to generate hallucinations and non-factual content undermines their reliability in high-stakes domains, where rigorous control over Type I errors (the conditional probability of incorrectly classifying hallucinations as truthful content) is essential. Despite its importance, formal verification of LLM factuality with such guarantees remains largely unexplored.
In this paper, we introduce FactTest, a novel framework that statistically assesses whether an LLM can provide correct answers to given questions with high-probability correctness guarantees. We formulate hallucination detection as a hypothesis testing problem to enforce an upper bound of Type I errors at user-specified significance levels. Notably, we prove that FactTest also ensures strong Type II error control under mild conditions and can be extended to maintain its effectiveness when covariate shifts exist. Our approach is distribution-free and works for any number of human-annotated samples. It is model-agnostic and applies to any black-box or white-box LM. Extensive experiments on question-answering (QA) benchmarks demonstrate that FactTest effectively detects hallucinations and enable LLMs to abstain from answering unknown questions, leading to an over 40% accuracy improvement.
Lay Summary: Large Language Models (LLMs), like ChatGPT, are powerful tools capable of generating remarkably human-like text—but they have a troubling habit of presenting false or fabricated information, a phenomenon known as "hallucinations." Imagine a medical diagnosis or legal advice given with absolute confidence but entirely incorrect; the risks are enormous. To combat this problem, we created FactTest, a robust statistical tool that checks whether the answers from language models are trustworthy or not. FactTest operates by statistically evaluating the correctness of a model’s responses and enables the model to "politely refuse" to answer when it's uncertain, thereby dramatically reducing false statements. Unlike previous solutions, FactTest offers clear-cut mathematical guarantees to keep mistakes under tight control. Our method boosts the accuracy of responses by more than 40% over typical models, adapts gracefully when encountering new or different kinds of questions, and even enhances the trustworthiness of commercial "black-box" models, whose internal workings are inaccessible.
Link To Code: https://github.com/fannie1208/FactTest
Primary Area: Deep Learning->Large Language Models
Keywords: Large Language Models, Uncertainty Quantification, Hallucination Detection, Neyman-Pearson Classification, Comformal Prediction
Submission Number: 5102
Loading