Abstract: Text-to-Image (T2I) models typically deploy safety mechanisms to prevent the generation of sensitive images. Unfortunately, recent jailbreaking attack methods manually design
instructions for the LLM to generate adversarial prompts,
which effectively expose safety vulnerabilities of T2I models.
However, existing methods have two limitations: 1) relying
on manually exhaustive strategies for designing adversarial
prompts, lacking a unified framework, and 2) requiring numerous queries to achieve a successful attack, limiting their
practical applicability. To address this issue, we propose Reason2Attack (R2A), which aims to enhance the effectiveness
and efficiency of the LLM in jailbreaking attacks. Specifically, we first use Frame Semantics theory to systematize existing manually crafted strategies and propose a unified generation framework to generate CoT adversarial prompts step by
step. Following this, we propose a two-stage LLM reasoning
training framework guided by the attack process. In the first
stage, the LLM is fine-tuned with CoT examples generated
by the unified generation framework to internalize the adversarial prompt generation process grounded in Frame Semantics. In the second stage, we incorporate the jailbreaking task
into the LLM’s reinforcement learning process, guided by the
proposed attack process reward function that balances prompt
stealthiness, effectiveness, and length, enabling the LLM to
understand T2I models and safety mechanisms. Extensive experiments on various T2I models with safety mechanisms,
and commercial T2I models show the superiority and practicality of R2A.
Loading