ARTS: Alleviating Hallucinations in Large Vision–Language Models via Redundancy-Aware Token Selection
Keywords: Vision-Language Models (VLMs), Hallucination Mitigation, Decoding Method, Redundancy Reduction
TL;DR: We find the hallucination problem in Vision-Language Models (VLMs), we propose a novel and training-free decoding method
Abstract: Large Vision--Language Models (LVLMs) demonstrate significant potential in multimodal tasks, yet they are prone to hallucinations, where generated outputs deviate from the visual evidence. A mainstream approach to mitigate hallucinations in LVLMs is to develop training-free decoding strategies. Most of these methods posit that hallucinations stem from insufficient attention to relevant information and therefore focus on strengthening the model’s utilization of informative content. Beyond this perspective, we reveal a new source of hallucination: Visual tokens in intermediate decoder layers often become redundant or noisy, thereby misleading multimodal reasoning. Next, we evaluate commonly used token-importance metrics and observe that they cannot effectively identify redundant visual tokens in this context. To address this problem, we introduce ARTS, a decoding strategy that
first reintegrates the original visual embeddings to enrich essential visual information, and then employs a novel sink-token-based method to select important visual tokens in intermediate decoder layers. Extensive experiments on multiple benchmarks and LVLM architectures demonstrate that our approach consistently reduces hallucinations and improves factual alignment.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 24075
Loading