The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models Via Visual Information Steering

Zhuowei Li; Haizhou Shi; Yunhe Gao; Di Liu; Zhenting Wang; Yuxiao Chen; Ting Liu; Long Zhao; Hao Wang; Dimitris N. Metaxas

The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models Via Visual Information Steering

Zhuowei Li, Haizhou Shi, Yunhe Gao, Di Liu, Zhenting Wang, Yuxiao Chen, Ting Liu, Long Zhao, Hao Wang, Dimitris N. Metaxas

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Vision-Language Models (LVLMs) can reason effectively over both textual and visual inputs, but they tend to hallucinate syntactically coherent yet visually ungrounded contents. In this paper, we investigate the internal dynamics of hallucination by examining the tokens logits rankings throughout the generation process, revealing three key patterns in how LVLMs process information: (1) *gradual visual information loss* -- visually grounded tokens gradually become less favored throughout generation, and (2) *early excitation* -- semantically meaningful tokens achieve peak activation in the layers earlier than the final layer. (3) *hidden genuine information* -- visually grounded tokens though not being eventually decided still retain relatively high rankings at inference. Based on these insights, we propose **VISTA** (**V**isual **I**nformation **S**teering with **T**oken-logit **A**ugmentation), a training-free inference-time intervention framework that reduces hallucination while promoting genuine information. VISTA works by combining two complementary approaches: reinforcing visual information in activation space and leveraging early layer activations to promote semantically meaningful decoding. Compared to existing methods, VISTA requires no external supervision and is applicable to various decoding strategies. Extensive experiments show that VISTA on average reduces hallucination by about 40% on evaluated open-ended generation task, and it consistently outperforms existing methods on four benchmarks across four architectures under three decoding strategies. Code is available at https://github.com/LzVv123456/VISTA.

Lay Summary: AI systems that can understand both images and text are becoming increasingly powerful, but they have a serious flaw: they often "hallucinate" by describing things that aren't actually in the images they're looking at. For example, when shown a photo of a baseball game, these systems might confidently describe objects or people that simply aren't there, making them unreliable for real-world applications. We discovered that as these AI models generate descriptions, they gradually lose track of visual information and instead rely too heavily on language patterns they learned during training. To fix this, we developed VISTA, a method that works like visual "steering" – it continuously reinforces what the AI actually sees in the image throughout the text generation process, while also using information from earlier processing layers where visual understanding is stronger. Our approach reduces hallucinations by about 40% across different AI models and tasks, without requiring any retraining of the systems. This makes vision-language AI more trustworthy and reliable for applications like medical image analysis, autonomous vehicles, and accessibility tools, helping ensure these powerful systems describe what they actually see rather than what they think they should see.

Link To Code: https://github.com/LzVv123456/VISTA

Primary Area: Deep Learning->Foundation Models

Keywords: Large Vision-Language Model, Hallucination

Submission Number: 11781

Loading