(Im)possibility of Automated Hallucination Detection in Large Language Models

Amin Karbasi; Omar Montasser; John Sous; Grigoris Velegkas

(Im)possibility of Automated Hallucination Detection in Large Language Models

Amin Karbasi, Omar Montasser, John Sous, Grigoris Velegkas

Published: 01 Jul 2025, Last Modified: 01 Jul 2025ICML 2025 R2-FM Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: hallucinations, theory, RLHF

TL;DR: We propose a novel theoretical model for hallucination detection and show that it is generally impossible to automate this task only with positive examples; however, if we have negative examples, the task becomes much easier.

Abstract: Is automated hallucination detection fundamentally possible? In this paper, we introduce a theoretical framework to rigorously study the (im)possibility of automatically detecting hallucinations produced by large language models (LLMs). Our model builds on the classical Gold-Angluin framework of language identification and its recent adaptation by Kleinberg and Mullainathan to the language generation setting. Concretely, we investigate whether an algorithm—trained on examples from an unknown target language $K$, chosen from a countable collection of languages $\mathcal{L}$, and given access to an LLM—can reliably determine if the LLM’s outputs are correct or constitute hallucinations. First, we establish a strong equivalence between hallucination detection and the classical problem of language identification. Specifically, we prove that any algorithm capable of identifying languages (in the limit) can be efficiently transformed into one that reliably detects hallucinations, and conversely, successful hallucination detection strategy inherently implies language identification. Given the notorious difficulty of language identification, our first result implies that hallucination detection is impossible for most collections of languages. Second, we show that once we enrich the detector’s training data, i.e., providing it with both positive examples (correct statements) and negative examples (explicitly labeled incorrect statements)— the conclusion dramatically changes. Under this enriched training regime, we show that automated hallucination detection is possible for any countable collection $\mathcal{L}$. Our theoretical results, thus, underscore the fundamental importance of expert-labeled feedback in the practical deployment of hallucination detection methods, reinforcing why feedback-based approaches, such as reinforcement learning with human feedback (RLHF), have proven so crucial in improving the reliability and safety of real-world LLMs.

Submission Number: 148

Loading