Keywords: Adversarial Attacks, Multimodal Large Language Models, Distributional Alignment, Energy Distance, Graph Neural Networks
Abstract: This paper studies the critical problem of targeted adversarial attacks against closed-source MLLMs, which aims to generate highly transferable adversarial samples with open-source MLLMs. Previous approaches typically focus on maximizing the similarity of latent representations between adversarial samples and target samples. However, these approaches could overfit specific target samples with severely limited generalization ability to closed-source MLLMs. Towards this end, we propose a novel approach named Relational Distribution-aware Intrinsic Alignment (RISE) for adversarial attacks against closed-source MLLMs. The core of our RISE is to adopt a statistical lens to characterize intrinsic semantics of images for more generalized and robust alignment. In particular, each augmented image is considered as an example from the intrinsic distribution of the original image. Then, we utilize non-parametric energy distance to measure the distribution divergence, which is naturally adopted for the semantic alignment in the hidden space. To further transferability to specific target models, we learn a Graph Neural Network (GNN) to explore the complex relations between source and target MLLMs on transferability and adaptively select surrogate source models for different target MLLMs. Extensive experiments on benchmark datasets validate the effectiveness of the proposed RISE in comparison to competing baselines.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 12137
Loading