Towards Adversarially Robust VLMs with an Information-Theoretic Approach

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Alignment, Safety, Representation Learning, Robustness, Vision-Language Model
TL;DR: In this work we propose information-theoretic approach to robustifying VLMs against adversarial attack
Abstract: Vision–Language Models (VLMs) derive their zero-shot ability from tight alignment between image and text representations, which can be viewed through the lens of mutual information (MI). This alignment is fragile: VLMs are vulnerable both to subtle pixel-level adversarial attacks and to typographic attacks in which overlaid text hijacks predictions. Existing defenses are isolated solutions, relying on proxy objectives tailored to each threat. We argue that both attack types share a single failure mechanism, the reduced cross-modal MI by threats; and we propose an information-theoretic framework that directly prevents this root cause of adversarial attacks on multi-modalities. We first prove a bound that links adversarial risk to the MI gap, defined as the reduction in MI between clean and perturbed image–text views. Building on this, we derive a practical, differentiable objective that minimizes an upper bound on the MI gap using a neural MI estimator, yielding a single, attack-agnostic training scheme. Empirically, our method improves robustness to both pixel-space and typographic attacks at the same time, surpassing specialized state-of-the-art defense methods while maintaining high accuracy on clean inputs. These results show that explicitly preserving cross-modal MI is a principled and effective path to robust VLMs.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 24177
Loading