Keywords: LLM hallucination, Time-Varying, Causal Interpretability
Abstract: Hallucinations in Large Language Models (LLMs) have emerged as a critical bottleneck for LLMs' application, causing misjudgments, amplifying bias, polluting information and eroding trust.
Prior studies on faithfulness hallucinations mainly focus on the contextual disconnection and the question-answer mismatch hallucinations. The former involves the internal inconsistency of the generated sequence, which is manifested as a logical contradiction or semantic break with the previous output. The latter relates to the the external inconsistency between the model and the user's intent, causing the answer to deviate from the question. However, current methods lack causal awareness and overlook dynamic evolution of hallucinations. To address these drawbacks, we introduce a Causal Analytical Framework for Time-Varying Dynamics of Hallucinations in LLMs. Specifically, we first clarify the unbiased causal effects of the prefix and the problem on the current generated sequence. Next, we propose a Time-Varying Causal Hallucination Index System to measure the contextual disconnection hallucination and the question-answer mismatch hallucination. Overall, our work has the following highlights:
(1) Causal tracing. We achieve the identification of causal pathways and the interpretable tracing of the root of hallucinations. (2) Precise and dynamic quantification. This framework describes the spatio-temporal dimensions of autoregressive hallucination generation, providing quantitative support for analysis and risk monitoring. (3) Reference-free. Our indexes effectively monitor hallucinations without standard answers, enabling unified measurement in no-ground-truth settings.
Primary Area: interpretability and explainable AI
Submission Number: 3242
Loading