Plan4Attack: Adaptive Strategy Planning for Efficient Jailbreaking of Large Vision-Language Models

ACL ARR 2026 May Submission16661 Authors

26 May 2026 (modified: 02 Jun 2026)ACL ARR 2026 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI Safety, Red-teaming
Abstract: Jailbreak attacks expose critical safety vulnerabilities in the alignment mechanisms of Large Vision-Language Models (LVLMs). Existing methods, however, predominantly commit to a single fixed strategy that fails to generalize across heterogeneous queries, yielding unstable attack success rates (ASR) and inconsistent response quality. Worse, they often rely on heavyweight LLMs or diffusion models for multi-round query rewriting, incurring prohibitive cost in realistic red-teaming. We propose Plan4Attack, a dynamic strategy-planning framework that, in contrast to prior LLM-as-rewriter pipelines, repositions the LLM as a strategy-reasoning planner, adaptively selecting the most compatible lightweight attack for each query. We cast jailbreaking as a reinforcement learning problem and introduce a multi-dimensional reward jointly modeling harmfulness, helpfulness, and jailbreak success probability, steering the LLM planner to dynamically select the most compatible strategy per query and generate adversarial prompts accordingly—achieving strong attacks at a fraction of prior cost. Across diverse open-source LVLMs, Plan4Attack improves ASR by 6.59\%–17.32\% and HFR by 16.34\%–23.76\%, transfers effectively to black-box commercial LVLMs, and substantially reduces query budget and attack latency. Warning: This paper contains example data that may be offensive or harmful.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: safety, adversarial attacks, red teaming
Contribution Types: NLP engineering experiment
Languages Studied: English
EMNLP 2026 AI Reviewing Experiment: yes
Submission Number: 16661
Loading