VEGAS: Mitigating Hallucinations in Large Vision-Language Models via Vision-Encoder Attention Guided Adaptive Steering
Abstract: Large vision–language models (LVLMs) exhibit impressive ability to jointly reason over visual and textual inputs.
However, they often produce outputs that are linguistically
fluent but factually inconsistent with the visual evidence,
i.e., they hallucinate. Despite growing efforts to mitigate
such hallucinations, a key question remains: what form
of visual attention can effectively suppress hallucinations
during decoding? In this work, we provide a simple answer: the vision encoder’s own attention map. We show that
LVLMs tend to hallucinate when their final visual-attention
maps fail to concentrate on key image objects, whereas
the vision encoder’s more concentrated attention maps substantially reduce hallucinations. To further investigate the
cause, we analyze vision–text conflicts during decoding and
find that these conflicts peak in the language model’s middle
layers. Injecting the vision encoder’s attention maps into
these layers effectively suppresses hallucinations. Building on these insights, we introduce VEGAS, a simple yet
effective inference-time method that integrates the vision
encoder’s attention maps into the language model’s midlayers and adaptively steers tokens which fail to concentrate on key image objects. Extensive experiments across
multiple benchmarks demonstrate that VEGAS consistently
achieves state-of-the-art performance in reducing hallucinations.
Loading