The HALoGen Benchmark: Fantastic LLM Hallucinations and Where To Find Them

Abhilasha Ravichander; Shrusti Ghela; David Wadden; Yejin Choi

The HALoGen Benchmark: Fantastic LLM Hallucinations and Where To Find Them

Abhilasha Ravichander, Shrusti Ghela, David Wadden, Yejin Choi

27 Sept 2024 (modified: 15 Oct 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: model hallucinations, benchmark

Abstract: Despite their impressive ability to generate high-quality and fluent text, generative large language models (LLMs) also produce hallucinations: statements that are misaligned with established world knowledge or provided input context. How- ever, measuring hallucination can be challenging, as having humans verify model generations on-the-fly is both expensive and time-consuming. In this work, we re- lease HALOGEN , a comprehensive hallucination benchmark consisting of: (1) 10,923 prompts for generative models spanning nine domains including program- ming, scientific attribution, and summarization, and (2) automatic high-precision verifiers for each use case that decompose LLM generations into atomic units, and verify each unit against a high-quality knowledge source. We use this framework to evaluate ∼150,000 generations from 14 language models, finding that even the best-performing models . We further define a novel error classification for LLM hallucinations based on their source: (1) Type A errors for errors that may stem from incorrect recollection from training data, (2) Type B errors for errors that may stem from incorrect knowledge in training data or incorrect contextualization, and (3) Type C errors for hallucinations that are likely to be fabrication. For code packages, we that 70% of unique packages hallucinated by Llama-3-70B can be found in the C4 corpus, while for another category of hallucinations about fictional historic events, we find that we can seldom find a basis for these events within the data. We hope that our framework will provide a foundation to enable princi- pled scientific studies of why generative models hallucinate, and to advance the development of trustworthy large language models.

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 12411

Loading