Not All Who Wander Are Lost: Hallucinations as Neutral Dynamics in Residual Transformers

ICLR 2026 Conference Submission25471 Authors

20 Sept 2025 (modified: 23 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Transformer architectures, Mean-field Games, Hallucinations, Stability and Dynamics
Abstract: Hallucinations in autoregressive models arise in two stages: an initial deviation from the truth and its continued propagation during decoding. Existing work addresses the first stage with empirical or diagnostic methods, but there is no fundamental account of the second stage. We give the first structural analysis of how paired continuations of the same prompt evolve inside pre-LayerNorm residual transformers, which form the backbone of most modern LLMs. By examining the residual stack and decoder, we show that their dynamics contain no built-in pull that suppresses deviations and no push that amplifies them. This neutrality is necessary, but not sufficient, for semantic hallucinations: it permits deviations to continue, yet a model can still correct the meaning even when predictive differences persist. Neutrality also yields an explicit upper bound, a separation between deterministic and stochastic effects, and a statistical validation rule at finite sample sizes. A population-level version follows by treating the small deviations across many continuations as agents in a mean-field average, showing that neutrality persists at scale without requiring access to individual weights. Experiments on GPT2 variants and Qwen2.5 models from 0.5B to 3B match the theoretical predictions.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 25471
Loading