Universal Style Triggers for Automated Red-Teaming of Vision-Language Models
Abstract: Content-dependent jailbreaks against commercial MLLMs suffer from low repeatability; a single prompt rewrite or image crop can neutralise the attack. We hypothesise that universal, content-agnostic style triggers offer a more reliable attack vector. By rendering arbitrary images through adversarially chosen artistic styles, we observe a measurable drop in refusal probability while caption quality remains intact. We formalise the search for such triggers as a constrained optimisation problem and solve it with a two-level reinforcement-learning loop: an outer bandit policy proposes stylistic parameters (palette, stroke, noise spectrum), and an inner critic allocates reward based on a tandem safety–consistency score. The resulting style triggers boost ASR of baseline attacks by 3–5× on five proprietary models, require only one-time offline training, and remain effective after model updates, revealing a persistent weakness in current safety alignment.
Loading