Break Yourself While Telling Me a Story : An Implicit Jailbreak Attack on LLMs via Open-ended Generation Traps

Break Yourself While Telling Me a Story : An Implicit Jailbreak Attack on LLMs via Open-ended Generation Traps

ACL ARR 2026 January Submission553 Authors

23 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLMs, AI safety, Jailbreaking

Abstract: Large Language Models (LLMs) exhibit strong open-ended generation capabilities, enabling them to produce coherent and contextually appropriate text across a wide range of tasks. However, this flexibility also exposes them to jailbreak attacks, where adversaries manipulate the model into producing malicious or policy-violating content. Existing jailbreak approaches typically rely on prompts with explicit harmful requests, which can easily be detected and filtered by the model's safety alignment mechanisms. In this work, we introduce OGT, a novel implicit jailbreak attack that operates without using harmful requests. Instead, OGT constructs an open-ended generation trap through a contextualized narrative scene, where the model is induced to actively generate harmful content. Driven by its internal objective to maintain narrative coherence and role consistency, the model gradually reveals malicious details as part of the story, effectively ''breaking itself while telling a story.'' We evaluate OGT on state-of-the-art models including GPT-4o, Gemini 2.5 Pro, and DeepSeek-R1. Results show a 99.5\%–100\% attack success rate and a response malicious rate up to 99.6\%, significantly outperforming existing jailbreak methods.

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: Language Modeling

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 553

Loading