A Systematic Benchmark for Evaluating Stylistic Vulnerabilities and Non-Content Jailbreak Attacks on MLLMs

Yue Huang

Published: 31 Aug 2025, Last Modified: 06 Nov 2025OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Multimodal Large Language Models (MLLMs) demonstrate robust comprehension capabilities yet remain vulnerable to jailbreak attacks exploiting non-content-based vulnerabilities. Recent findings reveal that MLLMs exhibit Stylistic Inconsistency—their defense mechanisms can be bypassed by specific visual styles despite maintaining comprehension ability. However, current evaluation frameworks lack systematic approaches to quantify these stylistic vulnerabilities across diverse MLLMs. We introduce StyleJailbreak-Bench, a comprehensive benchmark specifically designed to evaluate stylistic triggers and their effectiveness in bypassing safety alignment. Our benchmark comprises a taxonomy of visual style categories, each paired with harmful content to measure attack success rates (ASR) under various stylistic modifications. We employ Group Relative Policy Optimization (GRPO) principles to automatically discover optimal stylistic parameters, using a Fused Reward Function that combines logit-based refusal signals with high-fidelity semantic evaluations from judge models. Extensive experiments on commercial closed-source MLLMs and open-source models demonstrate that stylistic biases represent a critical, scalable vulnerability vector. We systematically analyze the Stylistic Inconsistency phenomenon, revealing that while comprehension remains robust across visual styles, safety ability degrades significantly. Our benchmark provides a plug-and-play evaluation module for red-teaming MLLMs against non-content-based attacks, enabling researchers to quantify stylistic vulnerabilities and develop more robust safety alignment strategies. The findings underscore that addressing content-based jailbreaks alone is insufficient—comprehensive safety mechanisms must account for stylistic triggers as a modular and persistent threat to MLLM security.