TL;DR: Mechanisms of non-factual hallucination in language model as well as their manifestation, evolution, and application.
Abstract: State-of-the-art language models (LMs) sometimes generate non-factual hallucinations that misalign with world knowledge. Despite extensive efforts to detect and mitigate hallucinations, understanding their internal mechanisms remains elusive. Our study investigates the mechanistic causes of hallucination, especially non-factual ones where the LM incorrectly predicts object attributes in response to subject-relation queries. With causal mediation analysis and embedding space projection, we identify two mechanistic causes: 1) insufficient attribute knowledge in lower-layer MLPs, and 2) failing to select the correct object attribute in upper-layer attention heads. These mechanisms in non-factual hallucinations exhibit varying degrees of subject-object association, predictive uncertainty and perturbation robustness. Additionally, we scrutinize LM pre-training checkpoints, revealing distinct learning dynamics for the two mechanistic causes of hallucinations. We also highlight how attribution features from our causal analysis can effectively construct hallucination detectors. Our work pioneers a mechanistic understanding of LM factual errors, fostering transparent and explainable approaches for hallucination mitigation.
Paper Type: long
Research Area: Interpretability and Analysis of Models for NLP
Contribution Types: Model analysis & interpretability, Data resources, Theory
Languages Studied: English
0 Replies
Loading