Do Language Models Know When They're Hallucinating References?

NeurIPS 2023 Workshop ICBINB Submission36 Authors

Published: 27 Oct 2023, Last Modified: 01 Dec 2023ICBINB 2023EveryoneRevisionsBibTeX
Keywords: Large Language Models, Hallucinations, Investigative Interviewing
TL;DR: Large Language models can be detectives, investigating their own generated references by asking questions (in new sessions) about who the authors are.
Abstract: State-of-the-art language models (LMs) are famous for "hallucinating'' references. These fabricated article and book titles lead to harms, obstacles to their use, and public backlash. While other types of LM hallucinations are also important, we propose hallucinated references as the "drosophila'' of research on hallucination in large language models (LLMs), as they are particularly easy to study. We show that simple search engine queries reliably identify such hallucinations, which facilitates evaluation. To begin to dissect the nature of hallucinated LM references, we attempt to classify them using black-box queries to the same LM, without consulting any external resources. Consistency checks done with _direct_ queries about whether the generated reference title is real (inspired by Kadavath et al. (2022), Lin et al. (2022) and Manakul (2023)) are compared to consistency checks with _indirect_ queries which ask for ancillary details such as the authors of the work. These consistency checks are found to be partially reliable indicators of whether or not the reference is a hallucination. In particular, we find that LMs often hallucinate _differing_ authors of hallucinated references when queried in independent sessions, while _consistently_ identify authors of real references. This suggests that the hallucination may be more a generation issue than inherent to current training techniques or representation.
Submission Number: 36