The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering
Abstract: Large Vision-Language Models (LVLMs) can rea-
son effectively over both textual and visual in-
puts, but they tend to hallucinate syntactically
coherent yet visually ungrounded contents. In
this paper, we investigate the internal dynamics
of hallucination by examining the tokens logits
ranking throughout the generation process, re-
vealing three key patterns in how LVLMs pro-
cess information: (1) gradual visual informa-
tion loss – visually grounded tokens gradually
become less favored throughout generation, and
(2) early excitation – semantically meaningful to-
kens achieve peak activation in the layers earlier
than the final layer. (3) hidden genuine informa-
tion – visually grounded tokens though not be-
ing eventually decoded still retain relatively high
rankings at inference. Based on these insights,
we propose VISTA (Visual Information Steering
with Token-logit Augmentation), a training-free
inference-time intervention framework that re-
duces hallucination while promoting genuine in-
formation. VISTA works by combining two com-
plementary approaches: reinforcing visual infor-
mation in activation space and leveraging early
layer activations to promote semantically mean-
ingful decoding. Compared to existing methods,
VISTA requires no external supervision and is
applicable to various decoding strategies. Ex-
tensive experiments show that VISTA on aver-
age reduces hallucination by about 40% on eval-
uated open-ended generation task, and it con-
sistently outperforms existing methods on four
benchmarks across four architectures under three
decoding strategies.
Loading