PruneHal: Reducing Hallucinations in Multi-modal Large Language Models through Adaptive KV Cache Pruning
Keywords: Multi-modal, token pruning, hallucination
TL;DR: A training-free, adaptive pipeline for MLLMs' hallucination mitigation
Abstract: While multi-modal large language models (MLLMs) have made significant progress in recent years, the issue of their hallucinations remains a major challenge. To mitigate this phenomenon, existing solutions either introduce additional data for further training or incorporate external or internal information during inference. However, these approaches inevitably introduce extra computational costs. In this paper, we observe that the occurrence of hallucinations in MLLMs is closely associated with low attention distribution on visual tokens. Moreover, due to the redundancy of visual tokens, the redundant visual tokens divert part of the model's attention, further causing important visual tokens to be neglected. Building on this observation, we propose \textbf{PruneHal}, a training-free, simple yet effective method that leverages adaptive KV cache pruning to enhance the model’s focus on critical visual information, thereby mitigating hallucinations. To the best of our knowledge, we are the first to apply token pruning for hallucination mitigation in MLLMs. Notably, our method don't require additional training and incurs no extra inference cost, thereby introducing no computational overhead. Moreover, PruneHal is model-agnostic and can be seamlessly integrated with different decoding strategies, including these specifically designed for hallucination mitigation. We evaluate PruneHal on several widely used MLLMs and benchmarks, achieving robust and outstanding experimental results that highlight the potential of our method. Our code will be publicly available.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 10313
Loading