The Jailbreak Tax: How Useful are Your Jailbreak Outputs?

Kristina Nikolić; Luze Sun; Jie Zhang; Florian Tramèr

The Jailbreak Tax: How Useful are Your Jailbreak Outputs?

Kristina Nikolić, Luze Sun, Jie Zhang, Florian Tramèr

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 spotlightposterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Our work proposes jailbreak utility as a new important metric in AI safety, and introduces benchmarks to evaluate existing and future jailbreaks.

Abstract: Jailbreak attacks bypass the guardrails of large language models to produce harmful outputs. In this paper, we ask whether the model outputs produced by existing jailbreaks are actually *useful*. For example, when jailbreaking a model to give instructions for building a bomb, does the jailbreak yield good instructions? Since the utility of most unsafe answers (e.g., bomb instructions) is hard to evaluate rigorously, we build new jailbreak evaluation sets with known ground truth answers, by aligning models to refuse questions related to benign and easy-to-evaluate topics (e.g., biology or math). Our evaluation of eight representative jailbreaks across five utility benchmarks reveals a consistent drop in model utility in jailbroken responses, which we term the *jailbreak tax*. For example, while all jailbreaks we tested bypass guardrails in models aligned to refuse to answer math, this comes at the expense of a drop of up to 92% in accuracy. Overall, our work proposes jailbreak utility as a new important metric in AI safety, and introduces benchmarks to evaluate existing and future jailbreaks. We make the benchmark available at https://github.com/ethz-spylab/jailbreak-tax

Lay Summary: Large language models (like ChatGPT) are designed to refuse certain questions, such as instructions for making a bomb, because they follow built-in safety rules. However, people have found ways to trick the model into giving such dangerous answers. This kind of attack is called a jailbreak. In this work, we ask an important question: even if a jailbreak gets around safety filters, is the answer it produces actually useful? For example, are produced bomb making instructions helpful and correct? To find out, we made the model refuse to answer safe questions like math or science. Then we used jailbreaks to force it to respond. Since we already knew the right answers, we could check how useful the replies were. We discovered that jailbreaks often make the model give much worse answers — sometimes almost completely wrong. We call this the jailbreak tax — the price you pay in quality when you break the rules. So even if an attack makes the chatbot talk, the answer might not be very helpful.

Link To Code: https://github.com/ethz-spylab/jailbreak-tax

Primary Area: Social Aspects->Safety

Keywords: large language models, LLMs, jailbreaks, benchmark, utility

Submission Number: 7905

Loading