Jailbreaking in the Haystack

Jailbreaking in the Haystack

ICLR 2026 Conference Submission15865 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: long-context language models, long-context attack, jailbreak attacks, alignment robustness, adversarial prompting, model safety, prompt injection, contextual vulnerabilities, LLM evaluation

TL;DR: We present Ninja, a prompt-based jailbreak attack on long-context LLMs that hides harmful goals in benign input. Simply extending context length can degrade alignment, enabling stealthy, compute-efficient attacks.

Abstract: Recent advances in long-context language models (LMs) have enabled million-token inputs, expanding their capabilities across complex tasks like computer-use agents. Yet, the safety implications of these extended contexts remain unclear. To bridge this gap, we introduce NINJA (short for Needle-in-haystack jailbreak attack), a method that jailbreaks aligned LMs by appending benign, model-generated content to harmful user goals, leveraging the critical observation that the positioning of harmful goals plays a significant role in safety. Experiments show that NINJA significantly increases attack success rates across multiple small-to-mid-sized models, including LLaMA-3, Qwen-2.5 and Gemini Flash, achieving strong performance on HarmBench; we further validate positional effects in the BrowserART web-browsing agent framework. Unlike prior jailbreaking methods, our approach is low-resource, transferable, and less detectable. Moreover, we show that NINJA is compute-efficient: under a fixed compute budget, increasing context length can outperform increasing the number of trials in best-of-N jailbreak. These findings reveal that even benign long contexts—when crafted with careful goal positioning—introduce fundamental vulnerabilities in modern LMs.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 15865

Loading