The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering

Published: 05 Feb 2025, Last Modified: 26 Feb 2025OpenReview Archive Direct UploadEveryoneCC BY 4.0
Abstract: Large Vision-Language Models (LVLMs) can rea- son effectively over both textual and visual in- puts, but they tend to hallucinate syntactically coherent yet visually ungrounded contents. In this paper, we investigate the internal dynamics of hallucination by examining the tokens logits ranking throughout the generation process, re- vealing three key patterns in how LVLMs pro- cess information: (1) gradual visual informa- tion loss – visually grounded tokens gradually become less favored throughout generation, and (2) early excitation – semantically meaningful to- kens achieve peak activation in the layers earlier than the final layer. (3) hidden genuine informa- tion – visually grounded tokens though not be- ing eventually decoded still retain relatively high rankings at inference. Based on these insights, we propose VISTA (Visual Information Steering with Token-logit Augmentation), a training-free inference-time intervention framework that re- duces hallucination while promoting genuine in- formation. VISTA works by combining two com- plementary approaches: reinforcing visual infor- mation in activation space and leveraging early layer activations to promote semantically mean- ingful decoding. Compared to existing methods, VISTA requires no external supervision and is applicable to various decoding strategies. Ex- tensive experiments show that VISTA on aver- age reduces hallucination by about 40% on eval- uated open-ended generation task, and it con- sistently outperforms existing methods on four benchmarks across four architectures under three decoding strategies.
Loading