Many-shot Jailbreaking

Published: 25 Sept 2024, Last Modified: 06 Nov 2024NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY-NC-ND 4.0
Keywords: large language models, long context, robustness, jailbreaks, in-context learning
TL;DR: We investigate a simple yet effective long-context jailbreak, study the corresponding scaling laws and evaluate some mitigations against it.
Abstract: We investigate a family of simple long-context attacks on large language models: prompting with hundreds of demonstrations of undesirable behavior. This attack is newly feasible with the larger context windows recently deployed by language model providers like Google DeepMind, OpenAI and Anthropic. We find that in diverse, realistic circumstances, the effectiveness of this attack follows a power law, up to hundreds of shots. We demonstrate the success of this attack on the most widely used state-of-the-art closed-weight models, and across various tasks. Our results suggest very long contexts present a rich new attack surface for LLMs.
Primary Area: Safety in machine learning
Submission Number: 8432
Loading