A Unified Understanding of Adversarial Vulnerability Regarding Unimodal Models and Vision-Language Pre-training Models
Abstract: With Vision-Language Pre-training (VLP) models demonstrating powerful multimodal interaction capabilities, the application scenarios of neural networks is no longer confined to unimodal domains such as CV and NLP, but has expanded to more complex multimodal V+L downstream tasks. The security vulnerabilities of unimodal models have been extensively examined, whereas those of VLP models remain challenges. We note that in CV models, the understanding of images comes from annotated information, while VLP models is designed to learn image representations directly from raw text. Motivated by this discrepancy, we developed the Feature Guidance Attack (FGA), a novel method that uses text representations to direct the perturbation of clean images, resulting in the generation of adversarial images. FGA is orthogonal to many advanced attack strategies in the unimodal domain, facilitating the direct application of rich research findings from the unimodal to multimodal scenario. By appropriately introducing text attack into FGA, we construct Feature Guidance with Text Attack (FGA-T). Through the interaction of attacking two modalities, FGA-T achieves superior attack effects against VLP models. Moreover, incorporating data augmentation and momentum mechanisms significantly improves black-box transferability of FGA-T. Our method demonstrates stable and effective attack capabilities across various datasets, downstream tasks, and both black-box and white-box settings, offering a unified baseline for exploring the robustness of VLP models.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Engagement] Multimedia Search and Recommendation
Relevance To Conference: This work introduces a novel adversarial attack method, named Feature Guidance Attack (FGA), which leverages text representations to manipulate image inputs, aiming to deceive Vision-Language Pre-training (VLP) models. It innovatively bridges the gap between unimodal (such as Computer Vision, and Natural Language Processing) and multimodal domains, applying insights from unimodal adversarial strategies to the more complex realm of V+L (Vision+Language) tasks. The extension of FGA into FGA-T, by incorporating textual attacks, further enhances its efficacy by exploiting vulnerabilities across both visual and linguistic modalities. This dual-modality approach results in more potent adversarial examples, significantly impacting the robustness and security analysis of VLP models. By achieving superior attack performance across various VLP models and V+L tasks, this research not only highlights potential security vulnerabilities in multimodal systems but also sets a new benchmark for evaluating and improving the resilience of multimedia/multimodal processing technologies against adversarial attacks. The inclusion of data augmentation and momentum mechanisms to improve black-box transferability further underscores the potential of this approach to contribute to advancing the field by providing a new perspective on crafting more sophisticated adversarial examples in multimodal contexts.
Supplementary Material: zip
Submission Number: 2924
Loading