Keywords: Hallucination in Vision Language Model, Depth and Spatial-aware key value Cache Refinement, Key-Value Cache Manipulation, Multi Modal
Abstract: Large vision–language models (VLMs) deliver state-of-the-art results on a wide range of multimodal tasks, yet they remain prone to visual hallucinations, producing content that is not grounded in the input image.
Despite progress with visual supervision, reinforcement learning, and post-hoc attention reshaping, the representational origins of hallucinations remain unclear.
Our study reveals that successful grounding emerges when adjacent visual tokens exhibit coherent alignment, while hallucinations arise when key vectors scatter isotropically, weakening cross-modal attention and blurring object boundaries.
Building on this insight, we propose Depth and Spatial aware Cache Refinement (DSCR), a lightweight and training-free method that augments the Transformer's key-value (KV) cache with depth cues and 2D spatial proximity.
DSCR clusters vectors within objects and separates those across surfaces, guiding attention toward relevant regions without any fine-tuning.
Comprehensive evaluations show that DSCR consistently reduces hallucinations, delivering up to 41.6\% accuracy gains across MME, POPE, RePOPE, CHAIR, and a new depth-sensitive benchmark.
Our findings highlight KV-coherence as a core factor behind hallucinations and demonstrate a practical, model-agnostic solution for enhancing VLM reliability.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 7218
Loading