GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation

GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation

ACL ARR 2024 April Submission447 Authors

16 Apr 2024 (modified: 02 May 2024)ACL ARR 2024 April SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Jailbreaking research has been valuable for testing and understanding the safety and security issues of large language models (LLMs). In this paper, we introduce Iterative Refinement Induced Self-Jailbreak (IRIS), a novel approach that leverages the reflective capabilities of LLMs for self-jailbreaking using only black-box access. Unlike previous methods, IRIS simplifies the jailbreaking process by using a single model as both the attacker and target. This method first iteratively refines adversarial prompts through self-explanation, which is crucial for ensuring that well-aligned LLMs adhere to adversarial instructions. IRIS then rates and enhances the output to increase its harmfulness. We find that IRIS achieves jailbreak success rates of 98\% for GPT-4 and 92\% for GPT-4 Turbo in under 7 queries, significantly outperforming prior approaches while requiring substantially fewer queries, thereby establishing a new standard for interpretable jailbreaking methods.

Paper Type: Short

Research Area: Language Modeling

Research Area Keywords: red teaming, security and privacy, prompting

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 447

Loading