Abstract: Jailbreaking research has been valuable for testing and understanding the safety and security issues of large language models (LLMs). In this paper, we introduce Iterative Refinement Induced Self-Jailbreak (IRIS), a novel approach that leverages the reflective capabilities of LLMs for self-jailbreaking using only black-box access. Unlike previous methods, IRIS simplifies the jailbreaking process by using a single model as both the attacker and target. This method first iteratively refines adversarial prompts through self-explanation, which is crucial for ensuring that well-aligned LLMs adhere to adversarial instructions. IRIS then rates and enhances the output to increase its harmfulness. We find that IRIS achieves jailbreak success rates of 98\% for GPT-4 and 92\% for GPT-4 Turbo in under 7 queries, significantly outperforming prior approaches while requiring substantially fewer queries, thereby establishing a new standard for interpretable jailbreaking methods.
Paper Type: Short
Research Area: Language Modeling
Research Area Keywords: red teaming, security and privacy, prompting
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 447
Loading