Keywords: Large Language Models, Jailbreak, AI Safety, Alignment, Prompt Engineering
Abstract: As LLM deployment expands, jailbreak attacks pose growing safety concerns. We propose PUZZLED, a novel jailbreak method that exploits LLMs’ reasoning by masking harmful keywords as word puzzles. Using word search, anagram, and crossword puzzles, the model must solve the puzzle to reconstruct the harmful instruction before responding. Evaluations on five state-of-the-art LLMs show a high average attack success rate of 88.8\%, including 96.5\% on GPT-4.1 and 92.3\% on Claude 3.7 Sonnet, demonstrating that PUZZLED is a simple yet effective reasoning-based jailbreak strategy.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: adversarial attacks/examples/training, robustness, safety and alignment, red teaming
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 7796
Loading