Mitigating Visual Hallucinations for Reliable Multimodal Agents
Keywords: Hallucinations
Abstract: Large Vision-Language Models (LVLMs) are increasingly used as perception modules in multimodal agents, yet object hallucination can propagate to false tool use and unsafe downstream decisions. We identify a consistent three-phase attention structure in LVLM vision encoders---diffusion, focus, and rediffusion---and show that hallucination is most sensitive to low-attention tokens in the focus phase. We propose a lightweight training-free intervention that suppresses such tokens using single-pass statistics and DPP-based selection. Across multiple LVLM backbones, our method reduces hallucination with negligible overhead and improves reliability in an object-triggered tool-calling setup.
Track: Short Paper (4 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 51
Loading