Keywords: Token Compression; Robustness
TL;DR: We show for the first time that visual token pruning enhances the robustness of VLMs, mitigating vulnerabilities such as jailbreak attacks and hallucinations.
Abstract: In this paper, we show for the first time that visual token pruning enhances the robustness of Vision-Language Models (VLMs), mitigating vulnerabilities such as jailbreak attacks and hallucinations. Given that vision and language modalities cannot be perfectly aligned, the misaligned visual tokens might act as out-of-distribution (OOD) inputs, leading to unpredictable outputs and introducing potential vulnerabilities. Building on this insight, we aim to enhance model robustness against jailbreaks and hallucinations by selectively reducing visual tokens, while also reducing inference cost as a side benefit. Specifically, we measure the distance between each visual token and the language feature space. Then, visual tokens with large distances are identified as OOD tokens, which can be iteratively pruned. To demonstrate the effectiveness of our method, we evaluate it on seven diverse popular benchmarks. Notably, our method yields an average improvement of 13.46\% in defending jailbreak attacks, consistently achieves competitive performance in mitigating hallucinations, and maintains strong results on general datasets like MME.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 5130
Loading