Retrieval-Augmented Language Models Evade Hallucination Detection

Retrieval-Augmented Language Models Evade Hallucination Detection

ACL ARR 2025 May Submission5813 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Retrieval-Augmented Generation (RAG) intends to mitigate hallucinations by incorporating external knowledge sources. However, the seemingly accurate, authoritative responses of RAG models may unintendedly make hallucinations harder to detect. In this paper, we systematically investigate this phenomenon across three popular RAG frameworks and three question-answering datasets. Compared to vanilla language models, RAG increases the false negative rate of widely adopted automatic hallucination detectors from 23.8% to 52.0% on average. Furthermore, we study RAG's impacts of production models (DeepSeek-R1) on real human users. We find that RAG rises the false negative rate of hallucination detections by 5.4%. Finally, we show that optimizing RAG models with hallucination detectors cannot mitigate but exacerbate this problem: RAG models can hack hallucination detectors and further increase the false negative rate by 53.3%. We highlight an overlooked risk of RAG and call for more research in helping both machines and humans detect hallucinations.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: fact checking, rumor/misinformation detection, retrieval-augmented generation

Contribution Types: Model analysis & interpretability

Languages Studied: English

Keywords: retrieval-augmented generation, hallucination detection, large language model

Submission Number: 5813

Loading