SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains
Keywords: Jailbreak Attacks, LLM Vulnerabilities, Automated Prompt Formatting, Attack Success Rate
TL;DR: This paper introduces SequentialBreak, a single-query jailbreak method that embeds malicious instructions within prompt chains to exploit attention patterns in large language models and bypass safety mechanisms.
Abstract: As the use of Large Language Models (LLMs) expands, so do concerns about their vulnerability to jailbreak attacks. We introduce SequentialBreak, a novel single-query jailbreak technique that arranges multiple benign prompts in sequence with a hidden malicious instruction among them to bypass safety mechanisms. Sequential prompt chains in a single query can lead LLMs to focus on certain prompts while ignoring others. By embedding a malicious prompt within a prompt chain, we show that LLMs tend to ignore the harmful context and respond to all prompts including the harmful one. We demonstrate the effectiveness of our attack across diverse scenarios—including Q\&A systems, dialogue completion tasks, and levelwise gaming scenario—highlighting its adaptability to varied prompt structures. The variability of prompt structures shows that SequentialBreak is adaptable to formats beyond those discussed here. Experiments show that SequentialBreak only uses a single query to significantly outperform existing baselines on both open-source and closed-source models. These findings underline the urgent need for more robust defenses against prompt-based attacks. The Results and website are available on https://anonymous.4open.science/r/JailBreakAttack-4F3B/.
Archival Status: Archival
Paper Length: Long Paper (up to 8 pages of content)
Submission Number: 114
Loading