ARTS: Alleviating Hallucinations in Large Vision–Language Models via Redundancy-Aware Token Selection

Hanze Li; Xiande Huang; Ruoxi Jia

ARTS: Alleviating Hallucinations in Large Vision–Language Models via Redundancy-Aware Token Selection

Hanze Li, Xiande Huang, Ruoxi Jia

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision-Language Models (VLMs), Hallucination Mitigation, Decoding Method, Redundancy Reduction

TL;DR: We find the hallucination problem in Vision-Language Models (VLMs), we propose a novel and training-free decoding method

Abstract: Large Vision--Language Models (LVLMs) demonstrate significant potential in multimodal tasks, yet they are prone to hallucinations, where generated outputs deviate from the visual evidence. A mainstream approach to mitigate hallucinations in LVLMs is to develop training-free decoding strategies. Most of these methods posit that hallucinations stem from insufficient attention to relevant information and therefore focus on strengthening the model’s utilization of informative content. Beyond this perspective, we reveal a new source of hallucination: Visual tokens in intermediate decoder layers often become redundant or noisy, thereby misleading multimodal reasoning. Next, we evaluate commonly used token-importance metrics and observe that they cannot effectively identify redundant visual tokens in this context. To address this problem, we introduce ARTS, a decoding strategy that first reintegrates the original visual embeddings to enrich essential visual information, and then employs a novel sink-token-based method to select important visual tokens in intermediate decoder layers. Extensive experiments on multiple benchmarks and LVLM architectures demonstrate that our approach consistently reduces hallucinations and improves factual alignment.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 24075

Loading