Keywords: Vision-Language Models, Malicious Prompts, Defense, Efficiency, Robustness
Abstract: Vision-Language Models (VLMs) face significant safety vulnerabilities from malicious prompt attacks due to weakened alignment during visual integration. Existing defenses suffer from efficiency and robustness. To address these challenges, we first propose the **M**ultimodal **A**ggregated **F**eature **E**xtraction (**MAFE**) framework that enables CLIP to handle long text and fuse multimodal information into unified representations. Through empirical analysis of **MAFE**-extracted features, we discover distinct distributional patterns between benign and malicious prompts. Building upon this finding, we develop **VLMShield**, a lightweight safety detector that efficiently identifies multimodal malicious attacks as a plug-and-play solution. Extensive experiments demonstrate superior performance across multiple dimensions, including robustness, efficiency, and utility. Through our work, we hope to pave the way for more secure multimodal AI deployment.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 1340
Loading