Boosting Adversarial Robustness of Vision-Language Pre-training Models against Multimodal Adversarial attacks

Published: 05 Mar 2025, Last Modified: 14 Apr 2025BuildingTrustEveryoneRevisionsBibTeXCC BY 4.0
Track: Long Paper Track (up to 9 pages)
Keywords: vision-language pretraining models, adversarial fine-tuning
TL;DR: Our approach utilize multi-granularity aligned visual adversarial fine-tuning to enhance robustness of vision-language pretraining models against multimodal adversarial attacks.
Abstract: Vision-language pre-training (VLP) models, known for their generalization across multimodal tasks, are increasingly deployed in perturbation-sensitive environments, highlighting the need for improved adversarial robustness. Recent studies have revealed VLP models' vulnerability to multimodal adversarial attacks, which exploit interactions across multiple modalities to uncover deeper weaknesses than single-modal attacks. Methods like Co-attack, SGA, and VLP-attack leverage cross-modal interactions to more effectively challenge models' robustness. To counter these threats, adversarial fine-tuning has emerged as a key strategy. Our approach refines vision encoders using Multi-granularity Aligned Visual Adversarial Fine-tuning, which enhances robustness by expanding the vision semantic space and aligning features across perturbed and clean models. Extensive experiments demonstrate that our method offers superior robustness to multimodal adversarial attacks while preserving clean performance on downstream V+L tasks.
Submission Number: 57
Loading