Jailbreaking in the Haystack

Published: 10 Jun 2025, Last Modified: 24 Jun 2025LCFM 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: long-context language models, long-context attack, jailbreak attacks, alignment robustness, adversarial prompting, model safety, prompt injection, contextual vulnerabilities, LLM evaluation
TL;DR: We present Ninja, a prompt-based jailbreak attack on long-context LLMs that hides harmful goals in benign input. Simply extending context length can degrade alignment, enabling stealthy, compute-efficient attacks.
Abstract: Recent advances in long-context language models (LLMs) have enabled million-token inputs, expanding their capabilities across complex tasks like computer-use agents. Yet, the safety implications of these extended contexts remain unclear. To bridge this gap, we introduce Ninja (short for $N$eedle-$in$-haystack $j$ailbreak $a$ttack), a method that jailbreaks aligned LLMs by appending benign, model-generated content to harmful user goals. Critical to our method is the observation that the position of harmful goals plays an important role in safety. Experiments on the standard safety benchmark, HarmBench, show that Ninja significantly increases attack success rates across state-of-the-art open and proprietary models, including LLaMA, Qwen, and Gemini. Unlike prior jailbreaking methods, our approach is low-resource, transferable, and less detectable. Moreover, we show that Ninja is compute-optimal—under a fixed compute budget, increasing context length can outperform increasing the number of trials in best-of-$N$ jailbreaks. These findings reveal that even benign long contexts—when crafted with careful goal positioning—introduce fundamental vulnerabilities in modern LLMs.
Submission Number: 25
Loading