Individual and Common Attack: Enhancing Transferability in VLP Models Through Modal Feature Exploitation

Yaguan Qian, Yaxin Kong, Qiqi Bao, Zhaoquan Gu, Bin Wang, Shouling Ji, Jianping Zhang, Zhen Lei

Published: 2026, Last Modified: 12 Mar 2026IEEE Trans. Image Process. 2026EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Vision–Language Pretrained (VLP) models exhibit strong multimodal understanding and reasoning capabilities, finding wide application in tasks such as image–text retrieval and visual grounding. However, they remain highly vulnerable to adversarial attacks, posing serious reliability concerns in safety-critical scenarios. We observe that existing adversarial examples optimization methods typically rely on individual features from the other modality as guidance, causing the crafted adversarial examples to overfit that modality’s learning preferences and thus limiting their transferability. In order to further enhance the transferability of adversarial examples, we propose a novel adversarial attack framework, I&CA (Individual & Common feature Attack), which simultaneously considers individual features within each modality and common features cross-modal interactions. Concretely, I&CA first drives divergence among individual features within each modality to disrupt single-modality learning, and then suppresses the expression of common features during cross-modal interactions, thereby undermining the robustness of the fusion mechanism. In addition, to prevent adversarial perturbations from overfitting to the learning bias of the other modality, which may distort the representation of common features, we simultaneously introduce augmentation strategies to both modalities. Across various experimental settings and widely recognized multimodal benchmarks, the I&CA framework achieves an average transferability improvement of 6.15% over the state-of-the-art DRA method, delivering significant performance gains in both cross-model and cross-task attack scenarios.

External IDs:dblp:journals/tip/QianKBGWJZL26