PUZZLED: Jailbreaking LLMs through Word-Based Puzzles

ACL ARR 2026 January Submission7796 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Jailbreak, AI Safety, Alignment, Prompt Engineering
Abstract: As LLM deployment expands, jailbreak attacks pose growing safety concerns. We propose PUZZLED, a novel jailbreak method that exploits LLMs’ reasoning by masking harmful keywords as word puzzles. Using word search, anagram, and crossword puzzles, the model must solve the puzzle to reconstruct the harmful instruction before responding. Evaluations on five state-of-the-art LLMs show a high average attack success rate of 88.8\%, including 96.5\% on GPT-4.1 and 92.3\% on Claude 3.7 Sonnet, demonstrating that PUZZLED is a simple yet effective reasoning-based jailbreak strategy.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: adversarial attacks/examples/training, robustness, safety and alignment, red teaming
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 7796
Loading