Generalizable Adversarial Stylization via GRPO-Guided Hierarchical Optimization for Cross-Model Multimodal Jailbreaks
Abstract: The rapid evolution of Multimodal Large Language Models (MLLMs) has exposed persistent vulnerabilities in their safety alignment, particularly under stylistic perturbations that preserve underlying semantics. Extending prior findings on Stylistic Inconsistency, we propose Generalizable Adversarial Stylization (GAS) — a GRPO-driven optimization framework designed to discover universal, transferable stylistic triggers. GAS fine-tunes an adversarial style generator that applies subtle, non-semantic stylistic modifications, regulated by a Hybrid Reward Signal integrating both refusal-logit feedback and semantic coherence evaluation from a judge network. Unlike content-dependent jailbreaks, GAS identifies globally effective style vectors that remain potent across architectures and tasks. Extensive experiments reveal that GAS achieves significantly higher Attack Success Rates (ASR) on both closed- and open-domain MLLMs, establishing stylistic manipulation as a scalable and cross-model strategy for probing and red-teaming multimodal safety mechanisms.
Loading