VisionFocus: Towards Efficient Hallucination Mitigation via Token-Aware Visual Enhancement

16 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Large Language Models, Hallucination
Abstract: Despite their impressive capabilities, Multimodal Large Language Models (MLLMs) are prone to hallucinations. Recent efforts to address this issue have primarily focused on suppressing the inherent language priors of Large Language Models (LLMs) through contrastive decoding or uniformly enhancing attention to all visual tokens via attention intervention. However, these approaches either incur significant inference latency or exacerbate hallucinations in certain cases. In this work, we identify a critical insight: *Not all visual tokens are beneficial for hallucination mitigation*. Specifically, we observe that the vision encoder in MLLMs gradually focuses its attention on a limited subset of visual tokens. Further experiments demonstrate that tokens receiving high attention are crucial for mitigating hallucinations, whereas indiscriminate enhancement of low-attention tokens may exacerbate them. Based on these findings, we propose **VisionFocus**, a training-free, efficient, and plug-and-play method to mitigate hallucinations. It guides the model to concentrate on informative visual tokens during decoding, while avoiding excessive amplification of irrelevant or distracting visual information. This selective enhancement strengthens visual grounding and effectively mitigates hallucinations. Extensive experiments on six widely used benchmarks demonstrate the effectiveness of VisionFocus in mitigating hallucinations across various MLLM families without requiring additional training. In addition, VisionFocus achieves state-of-the-art performance in hallucination mitigation while maintaining competitive decoding speed, highlighting its practical utility. The code and models will be made publicly available soon.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 6765
Loading