Keywords: LLM Safety, Jailbreak Attacks, Refusal Bypass, Prompt-based Alignment, Adversarial Prompting, Content Moderation, Model Compliance, Safety Evaluation
Abstract: Despite advances in alignment (e.g., RLHF), large language models remain vulnerable to black-box jailbreaks. Many existing attacks rely on prompt obfuscation or iterative search, which can be costly and conspicuous. We propose Sequential-Compliance Prompting (SCP), a three-phase jailbreak framework built around multiple-choice interactions: it first elicits harmless cooperation, then induces an explicit choice of output schema, and finally escalates by appealing to that self-selected commitment. SCP keeps the original toxic request verbatim and exploits answer-schema obedience without per-instance optimization, suffix search, or gradient access. On HarmBench, SCP achieves a 98.3% attack success rate on GPT-4o, outperforming prior black-box baselines under our evaluation protocol. These results identify MCQ-style forced-choice prompting as an underexplored attack surface and motivate defenses that account for structural, not just lexical, manipulation.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: safety and alignment; prompting; robustness
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data analysis
Languages Studied: English
Submission Number: 9166
Loading