Keywords: Financial Question Answering, Hallucination Detection, Mechanistic Interpretability
TL;DR: Developed a hallucination detection method based on interpretable signals in financial question answering
Abstract: Retrieval-Augmented Generation (RAG) mitigates hallucinations by using external knowledge, yet models can still produce outputs inconsistent with retrieved evidence—a critical issue in financial QA. We find hallucinations often arise when later-layer feedforward networks (FFNs) over-inject parametric knowledge into the residual stream. To address this, we introduce external context scores and parametric knowledge scores, mechanistic features from Qwen3-0.6b across layers and attention heads. Using these signals, lightweight classifiers achieve strong detection performance and generalize to GPT-4.1-mini responses, demonstrating the promise of proxy-model evaluation for financial tasks. Mechanistic signals thus offer efficient, generalizable predictors for hallucination detection in RAG. Code and data: https://github.com/pegasi-ai/InterpDetect.}
Submission Number: 17
Loading