Plan4Attack: Dynamic Planning with LLM-Based Agents for Jailbreaking Large Vision-Language Models

Plan4Attack: Dynamic Planning with LLM-Based Agents for Jailbreaking Large Vision-Language Models

ACL ARR 2025 May Submission7248 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: This paper focuses on jailbreaking attacks against Large Vision-Language Models (LVLMs), aiming to induce offensive responses to harmful queries. Previous studies have demonstrated the effectiveness of various attack strategies, including textual, visual, and bi-modal jailbreaking prompts. However, relying on a single strategy often yields suboptimal success rates and response quality for diverse queries. Moreover, LVLMs often require numerous requests for successful attacks due to inherent limitations. To address these challenges, we propose Plan4Attack, an agent-based framework powered by the Large Language Model that dynamically selects the optimal attack strategy to enhance efficiency. Specifically, we first equip the agent with multi-strategy capabilities through instruction tuning and integrate jailbreaking attack into a reinforcement learning process. This allows the agent to generate optimal jailbreaking prompts based on the compatibility between queries and strategies. Subsequently, we design multi-dimensional rewards, such as prompt stealthiness, response relevance, and trigger rate, to improve understanding of the compatibility between queries, attack strategies, and LVLM security mechanisms. Experiments on various open-source LVLMs show that Plan4Attack boosts Attack Success Rate(ASR) by 6.59\%–17.32\% and improves Helpfulness Rate (HFR) by 16.34\%–23.76\%. Furthermore, our framework demonstrates strong transferability to black-box commercial LVLMs, high automation, and lower request overhead. The codes will be released. Warning: This paper contains example data that may be offensive or harmful.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: Safety and Alignment in LLMs

Contribution Types: NLP engineering experiment, Data analysis

Languages Studied: English

Keywords: Safety and Alignment in LLMs

Submission Number: 7248

Loading