Abstract: Retrieval-Augmented Generation (RAG) intends to mitigate hallucinations by incorporating external knowledge sources.
However, the seemingly accurate, authoritative responses of RAG models may unintendedly make hallucinations harder to detect. In this paper, we systematically investigate this phenomenon across three popular RAG frameworks and three question-answering datasets. Compared to vanilla language models, RAG increases the false negative rate of widely adopted automatic hallucination detectors from 23.8% to 52.0% on average. Furthermore, we study RAG's impacts of production models (DeepSeek-R1) on real human users. We find that RAG rises the false negative rate of hallucination detections by 5.4%. Finally, we show that optimizing RAG models with hallucination detectors cannot mitigate but exacerbate this problem: RAG models can hack hallucination detectors and further increase the false negative rate by 53.3%. We highlight an overlooked risk of RAG and call for more research in helping both machines and humans detect hallucinations.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: fact checking, rumor/misinformation detection, retrieval-augmented generation
Contribution Types: Model analysis & interpretability
Languages Studied: English
Keywords: retrieval-augmented generation, hallucination detection, large language model
Submission Number: 5813
Loading