Jailbreaking for the Average Jane: Choosing Optimal Jailbreaks via Bandit Algorithms for Automatically Enhanced Prompts

Prarabdh Shukla; Ritik; Suhas Devraj Rao; Arpit Agarwal; Arjun Bhagoji

Jailbreaking for the Average Jane: Choosing Optimal Jailbreaks via Bandit Algorithms for Automatically Enhanced Prompts

Prarabdh Shukla, Ritik, Suhas Devraj Rao, Arpit Agarwal, Arjun Bhagoji

Published: 03 Jun 2026, Last Modified: 03 Jun 2026AI4GOOD Workshop 2026 RegularEveryoneRevisionsBibTeXCC BY 4.0

Keywords: jailbreaking, AI safety, online learning, bandit algorithms

TL;DR: A novel approach to red-teaming using automatically enhanced prompts and bandit algorithms.

Abstract: With a profusion of jailbreaks for LLMs now widely known, a growing concern is that non-expert malicious actors (''the average jane'') may be able to obtain actionable responses for malicious requests. This work aims to examine the validity of such concerns. To effectively carry out an attack, a non-expert malicious actor needs to know both: the most effective jailbreak for their target model and an effective malicious prompt. For the former, we propose a novel bandit-based attack strategy to efficiently $\text{\textit{learn}}$ the optimal jailbreak from a large choice set by exploration on a (possibly noisy) exploration set of prompts, with subsequent application of the learnt policy on a high quality $\text{\textit{exploitation set}}$. As for the latter, we curate $\mathrm{FrankensteinBench}$, a safety benchmark of $11,279$ malicious prompts sourced via manual curation and from seven existing safety benchmarks. $\mathrm{FrankensteinBench}$ categorizes prompts as either $\text{\textit{simple}}$ or $\text{\textit{complex}}$ based on the level of technical expertise required to craft them. Our fears are justified by our findings: On average across a diverse set of models, $\text{\textit{complex}}$ prompts increase the attack success rate by $12$% and our bandit-based attack achieves success rates as high as $97$% on average over $15$ state-of-the-art open-weight LLMs.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 346

Loading