Keywords: multi-turn jailbreaks, jailbreaking
Abstract: Multi-turn jailbreak attacks are effective against text-only *large language models* (LLMs) by gradually introducing malicious content across turns. However, naively adding visual inputs can cause existing multi-turn jailbreaks to be easily defended by safety-aligned *large vision-language models* (LVLMs). For example, overly malicious visual input will easily trigger the defense mechanism of safety-aligned LVLMs, making the response more conservative. To address this, we propose ***MAPA***: a **m**ulti-turn **a**daptive **p**rompting **a**ttack that 1) *at each turn*, alternates text-vision attack actions to elicit the most malicious response; and 2) *across turns*, adjusts the attack trajectory through iterative back-and-forth refinement to gradually amplify response maliciousness. This two-level design enables ***MAPA*** to consistently outperform state-of-the-art methods, improving attack success rates by 5-13% on HarmBench and JailbreakBench against LLaVA-v1.6-Mistral-7B, Qwen2.5-VL-7B-Instruct, and Llama-3.2-Vision-11B-Instruct. Our code is available at: https://anonymous.4open.science/r/MAPA-jailbreak.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 18622
Loading