Enhancing the Robustness of Vision-Language Foundation Models by Alignment Perturbation

Published: 2025, Last Modified: 07 Jan 2026IEEE Trans. Inf. Forensics Secur. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: While Vision-Language Models (VLMs) based on large-scale models have shown revolutionary advancements across various vision-language tasks, research on improving VLM robustness remains underexplored. Existing studies primarily focus on attacking VLM after the pretrained visual or textual encoders, typically requiring obvious noise or long inference time. In this study, we look into VLM structure and highlight alignment module’s role as a protective filter that enhances VLM robustness against various perturbations. Motivated by these insights, we investigate VLM from both user and model developer perspectives and introduce the alignment perturbation strategy, which consists of multimodal, visual, and textual perturbations. Multimodal perturbation aims to achieve targeted textual output generation and is further utilized to enhance VLM robustness. Minimal perturbations to visual or textual inputs can lead to significant changes in the overall output of VLMs, revealing their sensitivity to both visual and textual input variations. Building on the alignment perturbation strategy, we propose alignment robust training, which efficiently improves VLM robustness by finetuning the parameters of alignment module without excessive resource consumption. Experiment results across various tasks and models demonstrate the effectiveness of the proposed alignment perturbation and alignment robust training. These methods deepen the understanding of VLM robustness, allowing for secure and reliable deployment towards diverse real-world scenarios. Codes are available at https://github.com/zhangconghhh/RobustVLMs
Loading