Keywords: adversarial attack, vision-encoder-only, large vision language models, downstream-agnostic
TL;DR: We propose a vision‑encoder‑only attack on LVLMs to achieve efficient and transferable untargeted attacks under limited perturbation sizes.
Abstract: Large Vision-Language Models (LVLMs) have demonstrated capabilities in multimodal understanding, yet their vulnerability to adversarial attacks raises significant concerns. To achieve practical attacking, this paper aims at efficient and transferable untargeted attacks under limited perturbation sizes. Considering this objective, white‑box attacks require full‑model gradients and task‑specific labels, making costs scale with tasks, while black‑box attacks rely on proxy models, typically requiring large perturbation sizes and elaborate transfer strategies. Given the centrality and widespread reuse of the vision encoder in LVLMs, we adopt a gray‑box setting that targets the vision encoder alone for efficient but effective attacking. We theoretically establish the feasibility of vision‑encoder‑only attacks, laying the foundation for our gray‑box setting. Based on this analysis, we propose perturbing patch tokens rather than the class token, informed by both theoretical and empirical insights. We generate adversarial examples by minimizing the cosine similarity between clean and perturbed visual features, without accessing the subsequent models, tasks, or labels. This significantly reduces computational overhead while eliminating the task and label dependence. VEAttack has achieved a performance degradation of 94.5% on image caption task and 75.7% on visual question answering task. We also reveal some key observations to provide insights into LVLM attack/defense: 1) hidden layer variations of LLM, 2) token attention differential, 3) Möbius band in transfer attack, 4) low sensitivity to attack steps.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 363
Loading