Decoupled Style Controller A Plug-and-Play Pathway for Elevating ASR via GRPO-Selected Stylistic Triggers in Vision–Language Models
Abstract: Safety-aligned Multimodal Large Language Models exhibit a persistent gap between semantic comprehension and safety behavior identical content is understood consistently yet safety filters can be bypassed by stylistic triggers. We operationalize this asymmetry with a plug-and-play Decoupled Style Controller composed of a frozen style encoder and a lightweight policy controller. The frozen style encoder embeds style into a continuous manifold while the lightweight policy controller selects and composes style codes. Rather than fine-tuning an image editor the controller wraps arbitrary adversarial inputs in style codes without altering task semantics. Policy learning employs GRPO Group Relative Policy Optimization under a Multi-level Reward Function. A logit-level refusal detector enforces low-tier safety and a high-fidelity judge model enforces semantic preservation at upper tiers. This decoupling enables stable search over non-intuitive style combinations while keeping inference overhead minimal. Evaluations across commercial multimodal LLMs including GPT-4V Claude and Gemini demonstrate that the module amplifies ASR Attack Success Rate of diverse base attacks requires no access to model internals and transfers across prompts and safety updates.
Loading