Keywords: Survey, Jailbreak Attacks
Abstract: Large language models (LLMs) face significant safety challenges from jailbreak attacks—techniques that manipulate prompts to bypass defenses and elicit harmful outputs. Existing taxonomies focus on manipulation methods rather than underlying mechanisms, limiting our understanding of attack effectiveness and defensive strategies.
In this work, we survey existing LLM jailbreak attacks and organize them using a novel two-fold taxonomy. Our technical taxonomy categorizes attacks across three tiers based on exploited vulnerabilities and approaches. Our operational taxonomy evaluates attacks across four dimensions to assess real-world feasibility and sustainability. Through correlation analysis, we reveal relationships between LLM vulnerabilities and practical attack constraints.
Applying our taxonomies to existing attacks identifies research gaps and provides insights for developing stronger offensive and defensive methods. Our work can contribute to systematic, risk-informed security improvements for LLMs, helping the research community move beyond reactive defenses.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: Language Modeling
Contribution Types: Surveys
Languages Studied: English
Submission Number: 6402
Loading