Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models
Abstract: The growing misuse of Vision-Language Models (VLMs) has led providers to deploy multiple safeguards—alignment tuning, system prompts, and content moderation.
Yet the real-world robustness of these defenses against adversarial attack remains underexplored.
We introduce MFA, a framework that systematically uncovers general safety vulnerabilities in leading defense-equipped VLMs, including GPT-4o, Gemini-Pro, and Llama 4.1, etc. Central to MFA is the Attention-Transfer Attack (ATA), which conceals harmful instructions inside a meta task with competing objectives. We offer a theoretical perspective grounded in reward-hacking to explain why such an attack can succeed. To maximize cross-model transfer, we introduce a lightweight transfer-enhancement algorithm combined with a simple repetition strategy that jointly evades both input- and output-level filters—without any model-specific fine-tuning.
We empirically show that adversarial images optimized for one vision encoder transfer broadly to unseen VLMs, indicating that shared visual representations create a cross-model safety vulnerability.
Combined, our techniques reach a 58.5% overall success rate, consistently outperforming existing methods.
Notably, on state-of-the-art commercial models, our attack achieves a 52.8% success rate, outperforming the second-best attack by 34%. These findings challenge the perceived robustness of current defensive mechanisms, systematically expose general safety loopholes within defense-equipped VLMs, and offer a practical probe for diagnosing and strengthening the safety of VLMs.
Loading