InterpDetect: Interpretable Signals for Detecting Hallucinations in Retrieval-Augmented Generation

Published: 23 Sept 2025, Last Modified: 17 Feb 2026CogInterp @ NeurIPS 2025 RejectEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mechanistic Interpretability, External knowledge vs.Parametric Knowledge, RAG Hallucination Detection
TL;DR: Developed a RAG hallucination detection approach by disentangling contributions from parametric knowledge and external context
Abstract: Retrieval-Augmented Generation (RAG) reduces hallucinations by incorporating external knowledge, yet models still produce content inconsistent with retrieved evidence. Effective detection requires separating external context from parametric knowledge, which prior methods conflate. We find hallucinations often emerge when later-layer FFNs inject excessive parametric knowledge into the residual stream. To address this, we explore \textit{external context scores} and \textit{parametric knowledge scores}, mechanistic features derived from Qwen3-0.6b across layers and attention heads. Using these signals, we train lightweight classifiers and evaluate against LLMs and detection baselines. Furthermore, classifiers trained on Qwen3-0.6b signals generalize to GPT-4.1-mini responses, demonstrating the potential of \texttt{proxy-model evaluation}. Our results show that mechanistic signals offer efficient, generalizable predictors for hallucination detection in RAG. \footnote{Code and data: \url{https://anonymous.4open.science/r/interpretability-hallucination-experiments-75D3}.}
Submission Number: 38
Loading