PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling
TL;DR: We propose PANDAS, a hybrid technique designed to improve jailbreaking success rate in long-context scenarios.
Abstract: Many-shot jailbreaking circumvents the safety alignment of LLMs by exploiting their ability to process long input sequences. To achieve this, the malicious target prompt is prefixed with hundreds of fabricated conversational exchanges between the user and the model. These exchanges are randomly sampled from a pool of unsafe question-answer pairs, making it appear as though the model has already complied with harmful instructions. In this paper, we present PANDAS: a hybrid technique that improves many-shot jailbreaking by modifying these fabricated dialogues with Positive Affirmations, Negative Demonstrations, and an optimized Adaptive Sampling method tailored to the target prompt's topic. We also introduce ManyHarm, a dataset of harmful question–answer pairs, and demonstrate through extensive experiments that PANDAS significantly outperforms baseline methods in long-context scenarios. Through attention analysis, we provide insights into how long-context vulnerabilities are exploited and show how PANDAS further improves upon many-shot jailbreaking.
Lay Summary: Large language models can be tricked into generating harmful output by overloading them with long, fake conversations. These conversations are designed to make it seem like the model has already followed dangerous instructions many times.
In this paper, we introduce PANDAS, a technique that improves this type of attack by modifying the fake conversations with positive affirmation phrases, negative demonstrations, and a more targeted selection of content.
Results on state-of-the-art open-source models show that PANDAS is more effective at eliciting harmful outputs than previous methods. We also analyze the models' intermediate outputs to understand the effect of PANDAS.
Link To Code: https://github.com/averyma/pandas
Primary Area: Deep Learning->Large Language Models
Keywords: jailbreaking, long-context LLMs, in-context learning
Submission Number: 13745
Loading