Keywords: model hallucinations, benchmark
Abstract: Despite their impressive ability to generate high-quality and fluent text, generative
large language models (LLMs) also produce hallucinations: statements that are
misaligned with established world knowledge or provided input context. How-
ever, measuring hallucination can be challenging, as having humans verify model
generations on-the-fly is both expensive and time-consuming. In this work, we re-
lease HALOGEN , a comprehensive hallucination benchmark consisting of: (1)
10,923 prompts for generative models spanning nine domains including program-
ming, scientific attribution, and summarization, and (2) automatic high-precision
verifiers for each use case that decompose LLM generations into atomic units, and
verify each unit against a high-quality knowledge source. We use this framework
to evaluate ∼150,000 generations from 14 language models, finding that even the
best-performing models . We further define a novel error classification for LLM
hallucinations based on their source: (1) Type A errors for errors that may stem
from incorrect recollection from training data, (2) Type B errors for errors that
may stem from incorrect knowledge in training data or incorrect contextualization,
and (3) Type C errors for hallucinations that are likely to be fabrication. For code
packages, we that 70% of unique packages hallucinated by Llama-3-70B can be
found in the C4 corpus, while for another category of hallucinations about fictional
historic events, we find that we can seldom find a basis for these events within
the data. We hope that our framework will provide a foundation to enable princi-
pled scientific studies of why generative models hallucinate, and to advance the
development of trustworthy large language models.
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 12411
Loading