Jailbreaking for the Average Jane: Choosing Optimal Jailbreaks via Bandit Algorithms for Automatically Enhanced Prompts
Keywords: jailbreaking, AI safety, online learning, bandit algorithms
TL;DR: A novel approach to red-teaming using automatically enhanced prompts and bandit algorithms.
Abstract: With a profusion of jailbreaks for LLMs now widely known, a growing concern is that non-expert malicious actors (''the average jane'') may be able to obtain actionable responses for malicious requests. This work aims to examine the validity of such concerns. To effectively carry out an attack, a non-expert malicious actor needs to know both: the most effective jailbreak for their target model and an effective malicious prompt. For the former, we propose a novel bandit-based attack strategy to efficiently $\text{\textit{learn}}$ the optimal jailbreak from a large choice set by exploration on a (possibly noisy) exploration set of prompts, with subsequent application of the learnt policy on a high quality $\text{\textit{exploitation set}}$. As for the latter, we curate $\mathrm{FrankensteinBench}$, a safety benchmark of $11,279$ malicious prompts sourced via manual curation and from seven existing safety benchmarks. $\mathrm{FrankensteinBench}$ categorizes prompts as either $\text{\textit{simple}}$ or $\text{\textit{complex}}$ based on the level of technical expertise required to craft them. Our fears are justified by our findings: On average across a diverse set of models, $\text{\textit{complex}}$ prompts increase the attack success rate by $12$% and our bandit-based attack achieves success rates as high as $97$% on average over $15$ state-of-the-art open-weight LLMs.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 346
Loading