Abstract: This paper focuses on jailbreaking attacks against Large Vision-Language Models (LVLMs), aiming to induce offensive responses to harmful queries. Previous studies have demonstrated the effectiveness of various attack strategies, including textual, visual, and bi-modal jailbreaking prompts. However, relying on a single strategy often yields suboptimal success rates and response quality for diverse queries. Moreover, LVLMs often require numerous requests for successful attacks due to inherent limitations. To address these challenges, we propose Plan4Attack, an agent-based framework powered by the Large Language Model that dynamically selects the optimal attack strategy to enhance efficiency. Specifically, we first equip the agent with multi-strategy capabilities through instruction tuning and integrate jailbreaking attack into a reinforcement learning process. This allows the agent to generate optimal jailbreaking prompts based on the compatibility between queries and strategies. Subsequently, we design multi-dimensional rewards, such as prompt stealthiness, response relevance, and trigger rate, to improve understanding of the compatibility between queries, attack strategies, and LVLM security mechanisms. Experiments on various open-source LVLMs show that Plan4Attack boosts Attack Success Rate(ASR) by 6.59\%–17.32\% and improves Helpfulness Rate (HFR) by 16.34\%–23.76\%. Furthermore, our framework demonstrates strong transferability to black-box commercial LVLMs, high automation, and lower request overhead. The codes will be released. Warning: This paper contains example data that may be offensive or harmful.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: Safety and Alignment in LLMs
Contribution Types: NLP engineering experiment, Data analysis
Languages Studied: English
Keywords: Safety and Alignment in LLMs
Submission Number: 7248
Loading