Stylistic Perturbation Amplification: Reinforcement-Driven Visual Refinement for Scalable Jailbreaks in Multimodal LLMs
Abstract: Safety filters of contemporary Multimodal Large Language Models (MLLMs) are largely tuned on canonical image statistics, leaving them brittle under distribution shifts. We reveal that the same shift can be engineered into imperceptible stylistic perturbations that preserve human semantics yet systematically suppress refusal signals. Leveraging this observation, we introduce SPA, a lightweight plug-in that wraps any existing adversarial image with a learned visual refinement. SPA trains a diffusion-based editor through a constrained policy-search routine that maximises a hierarchical reward: a fast logit-based refusal detector and a slower, judge-model semantic score. The agent automatically discovers minute colour–texture cocktails that raise Attack Success Rate (ASR) from 18 % to 87 % on GPT-4V and Gemini-Pro without additional queries to the target model. Extensive ablations show the perturbations transfer across prompts, architectures and safety patches, confirming stylistic fragility as a dependable red-team primitive.
Loading